# Claim to support / refute:

#### - “The   financial   markets   do   not   punish   security   breaches.”

https://public.tableau.com/profile/melvin7659#!/vizhome/DataBreach_Effect_on_Stocks/Sheet23?publish=yes

## Organizing & Cleaning the Data

In [2]:
# Imports

import pandas as pd
import xlrd as xl

In [2]:
# Load data into Excel file
data = pd.ExcelFile("InfoisBeautiful-DataBreaches.xlsx")

In [3]:
# See sheet names in excel file
data.sheet_names

['2017 update', 'Jan 2015 update', 'July 2013 update (old)']

In [4]:
# Load first sheet into dataframe

DF = data.parse('2017 update')

In [5]:
# Preview data columns
DF.columns

Index(['Entity', 'alternative name', 'story', 'YEAR', 'records lost',
       'ORGANISATION', 'METHOD OF LEAK', 'interesting story',
       'NO OF RECORDS STOLEN', 'DATA SENSITIVITY', 'UNUSED', 'UNUSED.1',
       'Exclude', 'Unnamed: 13', '1st source link', '2nd source link',
       '3rd source', 'source name'],
      dtype='object')

In [6]:
# Remove unnecessary columns

StripCols = DF[["Entity", "YEAR", "records lost", "ORGANISATION", "METHOD OF LEAK", "DATA SENSITIVITY"]]
StripCols.head()

Unnamed: 0,Entity,YEAR,records lost,ORGANISATION,METHOD OF LEAK,DATA SENSITIVITY
0,,"years are encoded (0=2004, 8 = 2012, 9 = 2013,...","(use 3m, 4m, 5m or 10m to approximate unknown ...",,,1. Just email address/Online information 20 SS...
1,AOL,0,92000000,web,inside job,1
2,Automatic Data Processing,1,125000,financial,poor security,20
3,Ameritrade Inc.,1,200000,financial,lost / stolen device,20
4,Citigroup,1,3900000,financial,lost / stolen device,300


In [7]:
# remove first row with NaN values

cleaned = StripCols.dropna()
cleaned.head()

Unnamed: 0,Entity,YEAR,records lost,ORGANISATION,METHOD OF LEAK,DATA SENSITIVITY
1,AOL,0,92000000,web,inside job,1
2,Automatic Data Processing,1,125000,financial,poor security,20
3,Ameritrade Inc.,1,200000,financial,lost / stolen device,20
4,Citigroup,1,3900000,financial,lost / stolen device,300
5,Cardsystems Solutions Inc.,1,40000000,financial,hacked,300


Convert year values to real years

In [8]:
cleaned['YEAR'].astype(int)   # Convert data type
cleaned2 = cleaned.copy()
cleaned2['Real Year'] = cleaned['YEAR'] + 2004

cleaned3 = cleaned2[["Entity", "Real Year", "records lost", "ORGANISATION", "METHOD OF LEAK", "DATA SENSITIVITY"]]
cleaned3.tail()

Unnamed: 0,Entity,Real Year,records lost,ORGANISATION,METHOD OF LEAK,DATA SENSITIVITY
249,CEX,2018,2000000,retail,accidentally published,300
250,Swedish Transport Agency,2018,3000000,government,poor security,50000
251,Instagram,2018,6000000,web,hacked,1
252,Equifax,2018,143000000,financial,hacked,50000
253,Spambot,2018,711000000,web,poor security,4000


#### Some data in various rows had errors and need altering for consistency:

Most recent breaches appear as "2018". Change to "2017".

In [9]:
cleaned4 = cleaned3.copy()
cleaned4.loc[cleaned3['Real Year'] > 2017, 'Real Year'] = 2017
cleaned4.tail()

Unnamed: 0,Entity,Real Year,records lost,ORGANISATION,METHOD OF LEAK,DATA SENSITIVITY
249,CEX,2017,2000000,retail,accidentally published,300
250,Swedish Transport Agency,2017,3000000,government,poor security,50000
251,Instagram,2017,6000000,web,hacked,1
252,Equifax,2017,143000000,financial,hacked,50000
253,Spambot,2017,711000000,web,poor security,4000


One occurence of "web, tech" should change to "tech, web" for consistency with other "tech, web" rows

In [10]:
cleaned5 = cleaned4.copy()
cleaned5.loc[cleaned4['ORGANISATION'] == 'web, tech', 'ORGANISATION'] = 'tech, web'

Change one occurence of data sensitivity "3" to "300"

In [11]:
cleaned6 = cleaned5.copy()
cleaned6.loc[cleaned5['DATA SENSITIVITY'] == 3, 'DATA SENSITIVITY'] = 300

"twitch.tv" organization is listed as "healthcare". Change to "web, gaming"

In [12]:
cleaned7 = cleaned6.copy()
cleaned7.loc[cleaned6['Entity'] == 'Twitch.tv', 'ORGANISATION'] = 'gaming'

In [13]:
cleaned7.loc[cleaned7['Entity'] == 'Twitch.tv']

Unnamed: 0,Entity,Real Year,records lost,ORGANISATION,METHOD OF LEAK,DATA SENSITIVITY
180,Twitch.tv,2014,10000000,gaming,hacked,1


## Gather list of company stock ticker symbols:

Clean entity names for proper querying:

In [14]:
import re
from fuzzywuzzy import string_processing  # fuzzywuzzy used to manage ascii / unicode errors
from fuzzywuzzy import utils

# New column for ticker names
cleaned7['Stock Ticker'] = 'N/A'

    # All Entity names to a list
names = cleaned7.Entity.unique().tolist()
    # Remove ASCII characters from names
names2 = []

    # cleaning names for higher chance of successful search query
for i in names:
 i = i.replace("\"", "")
 i = i.replace("\'", "")
 i = i.replace(",", " ")
 i = utils.full_process(i)
 names2.append(i)
                        # Dictionaries:
Original = {}       # Original names from raw data
RealNames = {}      # Names w/ special characters removed
WebNames = {}       # Names for entering into URL

     # non-breaking space for web search: '%20' or '&nbsp;'
for i in range(0, len(names)):
 Original[i] = names[i]              # Need this to match back to DF
 RealNames[i] = names2[i]
 WebNames[i] = names2[i].replace(' ', '&nbsp;')

Find ticker symbols of each company and put into new column

In [15]:
import urllib.request

for i in RealNames:
                            # Perform the search query
    webLink = 'http://d.yimg.com/autoc.finance.yahoo.com/autoc?query=' + WebNames[i] + '&region=1&lang=en'
    with urllib.request.urlopen(webLink) as response:
       html = response.read()
    
    Buis = str(html)        # Convert byte type to string
    
                            # Result if query returns nothing
    empty = len(RealNames[i]) + len('{"ResultSet":{"Query":"","Result":[]}}')

    Ticker = []             # List of tickers from search query

    if len(Buis) > empty:             # If search query finds any info,
        BuisInfo = Buis.split(',')    # Split string into array

        for entry in BuisInfo:
            if "symbol" in entry:     # Append ticker symbols to list
                symbol = entry.split("\"")     # extract ticker symbol
                Ticker.append(symbol[-2])
    if len(Ticker) > 0:
        result = min(Ticker, key=len)
        cleaned7['Stock Ticker'][cleaned7['Entity'] == Original[i]] = result

Number of rows that did not get a stock ticker:

In [16]:
cleaned7['Entity'][cleaned7['Stock Ticker'] == 'N/A'].count()

66

We will only work with companies that have a stock ticker

In [165]:
cleaned8 = cleaned7[cleaned7['Stock Ticker'] != 'N/A'].copy()
cleaned8['Worst stock price drop rate after breach'] = 0.0

## Gathering stock prices for each company during year of data breach:

We do not have info on exact date when the breaches are publicized; only the year.  So assume the breach announcement happens at the largest percent drop in stock price in 10-day trading period within that year.

In [87]:
from pandas_datareader import data, wb
import pandas_datareader.data as web
import datetime

In [166]:
# Some index numbers empty from row removal, so reset required
cleaned8.reset_index(inplace=True)

In [167]:
cleaned8.columns

Index(['index', 'Entity', 'Real Year', 'records lost', 'ORGANISATION',
       'METHOD OF LEAK', 'DATA SENSITIVITY', 'Stock Ticker',
       'Worst stock price drop rate after breach'],
      dtype='object')

<B>MANUALLY</B> iterate through rows with the below code to get stock price rate for each row, from 0 to len of DF.  
Unable to use complete loop automation because web datareader doesn't always work.  
Can try re-run for a given row if errors occur and it might properly pull the data.  Otherwise, probably no stock data.  
Successful price printout means data entry success.

In [179]:
len(cleaned8)

185

In [328]:
n = 183  # CHANGE THIS STARTING NUMBER VALUE YOURSELF for all rows

for i in range(n, len(cleaned8)):
        # entry is one row of cleaned8.
 entry = cleaned8.values[i]

 entity = entry[1]
 year = entry[2]
 ticker = entry[-2]

 start = datetime.datetime(year, 1, 1)
 end = datetime.datetime(year,12,31)
 f = web.DataReader(ticker, 'yahoo', start, end)

 day1 = f['Close']            # We will only look at close values
 day2 = day1.shift(+10)

 ChangeAmnt = day1 - day2      # Negative is drop; positive is rise
 ChangePercent = ChangeAmnt / day1

 sortPercents = ChangePercent.sort_values()
                                # Enter price into DF
 cleaned8.loc[i,'Worst stock price drop rate after breach'] = sortPercents[0]
 print(i, sortPercents[0])

183 -0.0358219733065
184 -0.532264964543


Check data entry accuracy below:

In [329]:
cleaned8.loc[184]

index                                             252
Entity                                        Equifax
Real Year                                        2017
records lost                                143000000
ORGANISATION                                financial
METHOD OF LEAK                                 hacked
DATA SENSITIVITY                                50000
Stock Ticker                                      EFX
Worst stock price drop rate after breach    -0.532265
Name: 184, dtype: object

In [330]:
# Export to CSV

cleaned8.to_csv("DataBreach&Stocks.csv", header=True, encoding='utf-8')

Forgot to remove leftover rows that contain no stock data:

In [3]:
DF = pd.read_csv("DataBreach&Stocks.csv")

In [5]:
DF2 = DF[DF['Worst stock price drop rate after breach'] != 0.000000]

In [8]:
DF2.to_csv("DataBreach&Stocks2.csv", header=True, encoding='utf-8')

# Conclusion

Stock price data can be very detailed; I decided the most important stock measure is change in closing values within a 10-day trading window, during the year the company experienced a breach. Lowest stock price drop rate within 10 days trading window is assumed to be when the breach was announced or occured.  Then, I took the average stock price drop rates, grouped by organization type. I figured it would be most useful to viewers when they can see the performance of their own industry in relation to other industries in the stock market.

All companies' stocks can drop in value at various points in time. Severity and reasoning behind the drops has many factors behind it. Through this graph viewers can see what, and possibly also realize why some industry stock prices drop harder than others.