# Claim to support / refute:

#### - “Security   breaches   do   mostly   occur   in   less-tech-savvy   organizations”

https://public.tableau.com/profile/melvin7659#!/vizhome/Mel_Lab2_0/ofIncidentsbySavinessOrganization

## Organizing the Data

In [1]:
# Imports

import pandas as pd
import xlrd as xl

In [2]:
# Load data into Excel file
data = pd.ExcelFile("Information is Beautiful- Data Breaches (public).xlsx")

In [3]:
# See sheet names in excel file
data.sheet_names

[u'2017 update', u'Jan 2015 update', u'July 2013 update (old)']

In [4]:
# Load first sheet into dataframe

DF = data.parse('2017 update')

In [5]:
# Preview data columns
DF.columns

Index([              u'Entity',     u'alternative name',
                      u'story',                 u'YEAR',
               u'records lost',         u'ORGANISATION',
             u'METHOD OF LEAK',    u'interesting story',
       u'NO OF RECORDS STOLEN',     u'DATA SENSITIVITY',
                     u'UNUSED',             u'UNUSED.1',
                    u'Exclude',          u'Unnamed: 13',
            u'1st source link',      u'2nd source link',
                 u'3rd source',          u'source name'],
      dtype='object')

In [6]:
# Remove unnecessary columns

StripCols = DF[["Entity", "YEAR", "records lost", "ORGANISATION", "METHOD OF LEAK", "DATA SENSITIVITY"]]
StripCols.head()

Unnamed: 0,Entity,YEAR,records lost,ORGANISATION,METHOD OF LEAK,DATA SENSITIVITY
0,,"years are encoded (0=2004, 8 = 2012, 9 = 2013,...","(use 3m, 4m, 5m or 10m to approximate unknown ...",,,1. Just email address/Online information 20 SS...
1,AOL,0,92000000,web,inside job,1
2,Automatic Data Processing,1,125000,financial,poor security,20
3,Ameritrade Inc.,1,200000,financial,lost / stolen device,20
4,Citigroup,1,3900000,financial,lost / stolen device,300


In [7]:
# remove first row with NaN values

cleaned = StripCols.dropna()
cleaned.head()

Unnamed: 0,Entity,YEAR,records lost,ORGANISATION,METHOD OF LEAK,DATA SENSITIVITY
1,AOL,0,92000000,web,inside job,1
2,Automatic Data Processing,1,125000,financial,poor security,20
3,Ameritrade Inc.,1,200000,financial,lost / stolen device,20
4,Citigroup,1,3900000,financial,lost / stolen device,300
5,Cardsystems Solutions Inc.,1,40000000,financial,hacked,300


Convert year values to real years

In [8]:
cleaned['YEAR'].astype(int)   # Convert data type
cleaned2 = cleaned.copy()
cleaned2['Real Year'] = cleaned['YEAR'] + 2004

cleaned3 = cleaned2[["Entity", "Real Year", "records lost", "ORGANISATION", "METHOD OF LEAK", "DATA SENSITIVITY"]]
cleaned3.tail()

Unnamed: 0,Entity,Real Year,records lost,ORGANISATION,METHOD OF LEAK,DATA SENSITIVITY
249,CEX,2018,2000000,retail,accidentally published,300
250,Swedish Transport Agency,2018,3000000,government,poor security,50000
251,Instagram,2018,6000000,web,hacked,1
252,Equifax,2018,143000000,financial,hacked,50000
253,Spambot,2018,711000000,web,poor security,4000


### Some data in various rows had errors and need altering for consistency:

Most recent breaches appear as "2018". Change to "2017".

In [9]:
cleaned4 = cleaned3.copy()
cleaned4.loc[cleaned3['Real Year'] > 2017, 'Real Year'] = 2017
cleaned4.tail()

Unnamed: 0,Entity,Real Year,records lost,ORGANISATION,METHOD OF LEAK,DATA SENSITIVITY
249,CEX,2017,2000000,retail,accidentally published,300
250,Swedish Transport Agency,2017,3000000,government,poor security,50000
251,Instagram,2017,6000000,web,hacked,1
252,Equifax,2017,143000000,financial,hacked,50000
253,Spambot,2017,711000000,web,poor security,4000


One occurence of "web, tech" should change to "tech, web" for consistency with other "tech, web" rows

In [10]:
cleaned5 = cleaned4.copy()
cleaned5.loc[cleaned4['ORGANISATION'] == 'web, tech', 'ORGANISATION'] = 'tech, web'

Change one occurence of data sensitivity "3" to "300"

In [11]:
cleaned6 = cleaned5.copy()
cleaned6.loc[cleaned5['DATA SENSITIVITY'] == 3, 'DATA SENSITIVITY'] = 300

"twitch.tv" organization is listed as "healthcare". Change to "web, gaming"

In [12]:
cleaned7 = cleaned6.copy()
cleaned7.loc[cleaned6['Entity'] == 'Twitch.tv', 'ORGANISATION'] = 'gaming'

In [13]:
cleaned7.loc[cleaned7['Entity'] == 'Twitch.tv']

Unnamed: 0,Entity,Real Year,records lost,ORGANISATION,METHOD OF LEAK,DATA SENSITIVITY
180,Twitch.tv,2014,10000000,gaming,hacked,1


### Each organization type split into one of two Tech Saviness Categories:

In [14]:
# show all types of organizations listed:

cleaned7.ORGANISATION.unique()

array([u'web', u'financial', u'tech, retail', u'telecoms',
       u'government, military', u'government', u'retail', u'academic',
       u'energy', u'military', u'healthcare', u'tech',
       u'government, healthcare', u'web, gaming', u'gaming', u'media',
       u'military, healthcare', u'web, military', u'tech, web',
       u'transport', u'legal', u'app'], dtype=object)

### My definition of a tech-savy company:

#### - Nowadays, most companies use technology in their operations.  A company that simply uses the technology is not tech-savy.  A tech-savy company is one that is involved with the improvement of the technologies that all companies and consumers use, whether it be features or security.  If a company has multiple organisation categories, it is savy as long as one of those categories are tech related

In [15]:
# Make new column "Tech Saviness", categorize accordingly
cleaned7["Tech Saviness"] = "New"

cleaned7["Tech Saviness"][cleaned7["ORGANISATION"] == "web"] = "Savy"
cleaned7["Tech Saviness"][cleaned7["ORGANISATION"] == "financial"] = "Savy"
cleaned7["Tech Saviness"][cleaned7["ORGANISATION"] == "telecoms"] = "Savy"
cleaned7["Tech Saviness"][cleaned7["ORGANISATION"] == "tech"] = "Savy"
cleaned7["Tech Saviness"][cleaned7["ORGANISATION"] == "web, gaming"] = "Savy"
cleaned7["Tech Saviness"][cleaned7["ORGANISATION"] == "gaming"] = "Savy"
cleaned7["Tech Saviness"][cleaned7["ORGANISATION"] == "web, military"] = "Savy"
cleaned7["Tech Saviness"][cleaned7["ORGANISATION"] == "tech, web"] = "Savy"
cleaned7["Tech Saviness"][cleaned7["ORGANISATION"] == "app"] = "Savy"
cleaned7["Tech Saviness"][cleaned7["ORGANISATION"] == "tech, retail"] = "Savy"
cleaned7["Tech Saviness"][cleaned7["ORGANISATION"] == "government, military"] = "Savy"
cleaned7["Tech Saviness"][cleaned7["ORGANISATION"] == "government"] = "Savy"
cleaned7["Tech Saviness"][cleaned7["ORGANISATION"] == "government, healthcare"] = "Savy"
cleaned7["Tech Saviness"][cleaned7["ORGANISATION"] == "military"] = "Savy"
cleaned7["Tech Saviness"][cleaned7["ORGANISATION"] == "military, healthcare"] = "Savy"

cleaned7["Tech Saviness"][cleaned7["ORGANISATION"] == "retail"] = "NonSavy"
cleaned7["Tech Saviness"][cleaned7["ORGANISATION"] == "academic"] = "NonSavy"
cleaned7["Tech Saviness"][cleaned7["ORGANISATION"] == "energy"] = "NonSavy"
cleaned7["Tech Saviness"][cleaned7["ORGANISATION"] == "healthcare"] = "NonSavy"
cleaned7["Tech Saviness"][cleaned7["ORGANISATION"] == "media"] = "NonSavy"
cleaned7["Tech Saviness"][cleaned7["ORGANISATION"] == "transport"] = "NonSavy"
cleaned7["Tech Saviness"][cleaned7["ORGANISATION"] == "legal"] = "NonSavy"

In [16]:
# Export to CSV

cleaned7.to_csv("DataBreaches.csv", header=True, encoding='utf-8')

# Conclusion

Tech savy companies see more data breaches than non-savy companies, refuting the  claim that data breaches happen more often to non-savy companies.

Nowadays, most companies use technology in their operations.  A company that simply uses the technology is not tech-savy.  A tech-savy company is one that is involved with the improvement of the technologies that all companies and consumers use, whether it be features or security.  If a company has multiple organisation categories, it is savy as long as one of those categories are tech related

I chose to disregard number of records stolen per incident, because I believe this is irrelevant. Once a hacker gains access to sensitive information, he will already have access to all the information at that level of security.  Number of records stored is different for each company, and a company's number of customers has nothing to do with their levels of security or the value of their data.  Therefore saviness and organization type are more important factors to consider.