<a id="top"></a>
# Draw insights from car accident reports
This notebook shows you how to analyze car vehicle accidents based on accident reports for New York. The analysis steps in the notebook show how you can use the information about accidents to learn more about the possible causes for collisions. You will learn how to install additional Python packages, how to add external PySpark modules, and how to perform descriptive data analysis.

This notebook runs on Python 2 with Spark 2.0.2

## Load data

This data set covers all reported vehicle collisions in New York starting in July 2012 until the end of December 2017 and contains detailed information about the incidents.

The file is already part of the Tutorial project. Accesssing the data is as simple as reading the `csv` file into a DataFrame.

In [None]:
from pyspark.sql import SparkSession, Row

spark = SparkSession.builder.getOrCreate()

collisions = spark.read.csv('../datasets/NYPD_Motor_Vehicle_Collisions.csv', header='true', inferSchema='true') 
collisions.take(2)

### Caching the data
Here we use the `cache()` method to tell Spark that we are reusing the data so if it can be cached we avoid the cost of re-reading from storage.

In [None]:
collisions.cache()

In [None]:
# Print the number of records and display the DataFrame schema
print("Records: {}".format(collisions.count()))
collisions.printSchema()

## Load visualization packages

To plot data, this notebook will use the following two packages, which you need to import or install:

- [Matplotlib](http://matplotlib.org/), a basic plotting library for Python
- [Seaborn](http://stanford.edu/~mwaskom/software/seaborn/), a statistical data visualization library

The `seaborn` package is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.

The import commands are there to make these packages available.

The Pandas package was mentioned in the first tutorial. It is a Python library used for data analysis.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

import matplotlib.pyplot as plt
# matplotlib.patches lets us create colored patches, which we can use for legends in plots
import matplotlib.patches as mpatches
# seaborn also builds on matplotlib and adds graphical features and new plot types
# adjust settings
# The inline statement insures that the plot will show in the cell output. Look at the documentation for more information
%matplotlib inline
sns.set_style("white")
plt.rcParams['figure.figsize'] = (15, 15)

### Extracting the desired information
The following cell extracts records that have a latitude different than 0 and a specific set of attributes into a Pandas DataFrame.

A Panda DataFrame is a local structure. this means that all the data extracted from the Spark DataFram must fit in the local memory.

In here we see some changes to column names, and changes in data types (from double to float).<br/>
Finally, additional Pandas DataFramdes are created based on if people were killed or injured.

In [None]:
collisions_df = collisions
collisions_pd = collisions_df[collisions_df['LATITUDE'] != 0][['LATITUDE', 'LONGITUDE', 'DATE', 'TIME',
                                                               'BOROUGH', 'ON STREET NAME', 'CROSS STREET NAME',
                                                               'NUMBER OF PERSONS INJURED', 'NUMBER OF PERSONS KILLED',
                                                               'CONTRIBUTING FACTOR VEHICLE 1']].toPandas()

collisions_pd.columns = ['Latitude', 'Longitude', 'Date', 'Time', 'Borough', 'On Street',
                         'Cross Street', 'Persons Injured', 'Persons Killed', 'Contributing Factor']

collisions_pd['Latitude'] = collisions_pd['Latitude'].astype(float)
collisions_pd['Longitude'] = collisions_pd['Longitude'].astype(float)
collisions_pd['Persons Killed'] = collisions_pd['Persons Killed'].astype(float)
collisions_pd['Persons Injured'] = collisions_pd['Persons Injured'].astype(float)



#divide dataset into accident categories: fatal, non-fatal but with injuries, none of the above
killed_pd = collisions_pd[collisions_pd['Persons Killed']!=0]
injured_pd = collisions_pd[np.logical_and(collisions_pd['Persons Injured']!=0, collisions_pd['Persons Killed']==0)]
nothing_pd = collisions_pd[np.logical_and(collisions_pd['Persons Killed']==0, collisions_pd['Persons Injured']==0)]

### Create an explorative scatter plot of the data
Using an explorative scatter plot is a way to analyze certain characteristics of the data set. 

Create an intial explorative scatter plot of all collisions by using the latitude and longitude information in the raw data:

In [None]:
#create scatterplots
plt.figure(figsize=(15,10))
plt.scatter(collisions_pd.Longitude, collisions_pd.Latitude, alpha=0.05, s=4, color='darkseagreen')

#adjust more settings
plt.title('Motor Vehicle Collisions in New York City', size=25)
plt.xlim((-74.26,-73.7))
plt.ylim((40.5,40.92))
plt.xlabel('Longitude',size=20)
plt.ylabel('Latitude',size=20)

plt.show()

Although this is not a real street map of New York City, the scatter plot dots roughly correspond to the street grid. You see very few collisions in Central Park or on bridges, as opposed to street crossings and curves, where there is a noticeably higher density of collisions.

### Enhance the scatter plot with information about city boroughs
Now add information about the city boroughs and use a different color to depict each borough on the scatter plot:


In [None]:
manhattan = collisions_pd[collisions_pd['Borough']=='MANHATTAN']
bronx = collisions_pd[collisions_pd['Borough']=='BRONX']
brooklyn = collisions_pd[collisions_pd['Borough']=='BROOKLYN']
staten = collisions_pd[collisions_pd['Borough']=='STATEN ISLAND']
queens = collisions_pd[collisions_pd['Borough']=='QUEENS']


#create scatterplots
plt.figure(figsize=(15,10))
plt.scatter(manhattan.Longitude, manhattan.Latitude, s=1, color='blue', marker ='.')
plt.scatter(bronx.Longitude, bronx.Latitude, s=1, color='yellow', marker ='.')
plt.scatter(brooklyn.Longitude, brooklyn.Latitude, color='red', s=1, marker ='.')
plt.scatter(staten.Longitude, staten.Latitude, s=1, color='green', marker ='.')
plt.scatter(queens.Longitude, queens.Latitude, s=1, color='black', marker ='.')

#create legend
blue_patch = mpatches.Patch(label='Manhattan', color='blue')
yellow_patch = mpatches.Patch(color='yellow', label='Bronx')
red_patch = mpatches.Patch(color='red', label='Brooklyn')
green_patch = mpatches.Patch(color='green', label='Staten Island')
black_patch = mpatches.Patch(color='black', label='Queens')
plt.legend([blue_patch, yellow_patch, red_patch, green_patch, black_patch],
           ('Manhattan', 'Bronx', 'Brooklyn', 'Staten Island', 'Queens'), 
           loc='upper left', prop={'size':20})

#adjust more settings
plt.title('Motor Vehicle Collisions in New York City by borough', size=20)
plt.xlim((-74.26,-73.7))
plt.ylim((40.5,40.92))
plt.xlabel('Longitude',size=20)
plt.ylabel('Latitude',size=20)
plt.show()

#### Which neighborhoods have the highest total number of crashes? 

In [None]:
plt.figure(figsize=(8,5))
borough = collisions_df.groupBy('BOROUGH').count().sort('count').toPandas() # .iloc[1:,:]
borough['BOROUGH'].fillna('NONE', inplace=True)
colors = ['g','0.75','y','k','b','r']
borough.sort_values(by='count', ascending=True)['count'].plot.barh(color=colors)
plt.xlabel('Collisions')
plt.ylabel('Borough')
plt.title('Total Number of Collisions by Borough', size=15)
plt.yticks(range(0,6),borough['BOROUGH'])
plt.tight_layout()
plt.show()

#### The bar graph clearly shows that the most collisions happen in Brooklyn and the least on Staten Island.

In [None]:
# List the array that includes the count of accident per borough (NONE indicates that the borough was blank in the record)
borough

### Enhance the scatter plot to identify the accidents severity
We draw from Pandas DataFrames we created earlier to plot the severity in different color

In [None]:
#adjust settings
plt.figure(figsize=(15,10))

#create scatterplots
plt.scatter(nothing_pd.Longitude, nothing_pd.Latitude, alpha=0.04, s=1, color='blue')
plt.scatter(injured_pd.Longitude, injured_pd.Latitude, alpha=0.1, s=1, color='yellow')
plt.scatter(killed_pd.Longitude, killed_pd.Latitude, color='red', s=5)

#create legend
blue_patch = mpatches.Patch( label='car body damage', alpha=0.2, color='blue')
yellow_patch = mpatches.Patch(color='yellow', label='personal injury', alpha=0.5)
red_patch = mpatches.Patch(color='red', label='lethal accidents')
plt.legend([blue_patch, yellow_patch, red_patch],('car body damage', 'personal injury', 'fatal accidents'), 
           loc='upper left', prop={'size':20})

#adjust more settings
plt.title('Severity of Motor Vehicle Collisions in New York City', size=20)
plt.xlim((-74.26,-73.7))
plt.ylim((40.5,40.92))
plt.xlabel('Longitude',size=20)
plt.ylabel('Latitude',size=20)
plt.savefig('anothertry.png')

plt.show()

The resulting scatter plot shows that there are fatal accident hot spots throughout the city. You can see that in some areas car body damage is prevalent, while in other areas personal injuries happen more often.

## Clean and shape the data
After using scatter plots to analyze certain characteristics of the raw data set, you will now learn how to clean and shape the data set to enable more plotting and further analysis. 

Begin by looking at the column names again to better assess which information you can use:

In [None]:
collisions_header_list = collisions.columns[:-4] # all columns except the last 4 (see printSchema above)
# Remove a few additional columns form the list
collisions_header_list.remove("CONTRIBUTING FACTOR VEHICLE 3")
collisions_header_list.remove("CONTRIBUTING FACTOR VEHICLE 4")
collisions_header_list.remove("CONTRIBUTING FACTOR VEHICLE 5")
# Take only records that include the "ON STREET NAME" and "BOROUGH" and return only the desired attributes
collisions_df = collisions_df.dropna(how='any', subset=['ON STREET NAME', 'BOROUGH'])[collisions_header_list]
# Display statistics on the resulting DataFrame, specially the number of non-null values in each column
collisions_df.toPandas().info()

### Spatial and temporal normalization by using Spark

To obtain a consistent representation of the spatial and temporal information about collisions, you have to normalize the data. Normalization is the process of organizing the columns (attributes) and tables (relations) to minimize data redundancy. This step will help you in future analyses.

In [None]:
delchars = ''.join(c for c in map(chr, range(256)) if not c.isalnum())
deltable = dict((ord(char), None) for char in delchars) # Python2 unicode
normalization_code = {
    'avenue':'av',
    'ave':'av',
    'avnue': 'av',
    'street': 'st',
    'road': 'rd',
    'boulevard': 'blvd',
    'place': 'pl',
    'plaza': 'pl',
    'square': 'sq',
    'drive': 'dr',
    'lane': 'ln',
    'parkway': 'pkwy',
    'turnpike': 'tp',
    'terrace': 'ter',
    '1st': '1',
    '2nd':'2',
    '3rd': '3',
    '1th': '1',
    '2th': '2',
    '3th': '3',
    '4th': '4',
    '5th': '5',
    '6th': '6',
    '7th': '7', 
    '8th': '8',
    '9th': '9',
    '0th': '0',
    'west ': 'w ',
    'north ': 'n ',
    'east ': 'e ',
    'south ': 's ',
}
def normalize_street(s):
    # Lowercase
    s = s.lower()

    # Delete all non-alphanumeric characters
    if isinstance(s, unicode):
        s = s.translate(deltable)
    else:
        s = s.translate(None, delchars) # Python 2

    # Replace common abbreviations
    for k in sorted(normalization_code.keys()):
        s = s.replace(k, normalization_code[k])

    # Only keep ascii chars
    s = s.encode('ascii', errors='ignore').decode()

    return s

def row_parser(row):
    from datetime import datetime
    
    """
    Spatial and Temporal Normalization
    Returns the location, borough, year, month, day, hour; removes nonalphanumeric characters
    """
    # create a row dictionary
    row_dict = row.asDict()
    
    # temporal
    ## date
    temp = row_dict['DATE']
    hr = row_dict['TIME'].split(":")[0]
    try:
        a = datetime.strptime(temp+" "+hr, '%m/%d/%Y %H')
        dates =  [a]
    except:
        a = datetime.now()
        dates = [a]
    
    # location and borough
    location = normalize_street(row_dict['ON STREET NAME'])
    borough = row_dict['BOROUGH'].lower()
    
    
    # other cols
    others = [row_dict[column] for column in collisions_header_list
             if column not in ["ON STREET NAME", "OFF STREET NAME", "CROSS STREET NAME", "BOROUGH", "DATE", "TIME"]]

       

    # return everything together
    return dates + [location] + [borough] + others


#### Now apply. 

In [None]:
collisions_out_row = Row(*(["Time", "Street", "Borough"] + [c for c in collisions_header_list
                      if c not in ["ON STREET NAME", "OFF STREET NAME", "CROSS STREET NAME", "BOROUGH", "DATE", "TIME"]]))
collisions_out_index = list(collisions_out_row)

collisions_out = collisions_df.rdd.map(
    lambda row: collisions_out_row(*(row_parser(row)))).toDF()

In [None]:
# Statistics on our resulting DataFrame
collisions_out.toPandas().info()

### Investigating data attributes
You can draw information from your data by examining the attributes in the data set and finding out how useful they are. 

Begin by plotting the contributing factors of an accident:

In [None]:
from pyspark.sql.functions import desc

collisions_out_df = collisions_out

factor = collisions_out_df.groupBy('CONTRIBUTING FACTOR VEHICLE 1').count().sort(desc('count')).toPandas()
factor = factor[0:20].sort_index(ascending=False)
factor.plot(kind='barh', legend=False, color='blue', figsize=(14,10))
plt.title('Composition of: ' + 'CONTRIBUTING FACTOR VEHICLE 1', size=20)
plt.xlabel('Count')
plt.yticks(range(len(factor))[::-1], factor['CONTRIBUTING FACTOR VEHICLE 1'][::-1])
plt.show()

Running the code cell above shows you that the contributing factor can't be specified in most cases. However, factors like distraction, failure to yield right-of-way, and fatigue could have an influence.

### Sorting accidents by vehicle type
The data set has entries for a large number of car types. To avoid inconclusive results because the  number of car types is too large, regroup the car types into main categories like auto, bus, truck, taxi or other:

In [None]:
from collections import Counter

vehicletypecode, vehicletypecoderange = 'VEHICLE TYPE CODE ', range(1,6)
grouping = {
    'TAXI': 'Taxi',
    'AMBULANCE': 'Other',
    'BICYCLE': 'Other',
    'BUS': 'Bus',
    'FIRE TRUCK': 'Other', 
    'LARGE COM VEH(6 OR MORE TIRES)': 'Truck',
    'LIVERY VEHICLE': 'Truck',
    'MOTORCYCLE': 'Other', 
    'OTHER': 'Other',
    'PASSENGER VEHICLE': 'Auto',
    'PICK-UP TRUCK': 'Other',
    'PEDICAB': 'Other', 
    'SCOOTER': 'Other',
    'SMALL COM VEH(4 TIRES) ': 'Truck',
    'SPORT UTILITY / STATION WAGON': 'Auto', 
    'UNKNOWN': 'Other',
    'VAN': 'Auto',
    'UNSPECIFIED': 'Other',
    None: None
}
# Over time, additional groups have been used. The following function makes sure that any unknown group then falls under "Other"
def vtype_group(vtype):
    if grouping.has_key(vtype) :
        return grouping[vtype]
    else:
        return 'Other'

collisions_out_categories = collisions_out.rdd.map(lambda row:
                   collisions_out_row(*[vtype_group(row[i]) if collisions_out_index[i].startswith("VEHICLE TYPE CODE")
                                                    else row[i] for i in range(len(row))])
                 ).toDF()

In [None]:
collisions_transformed_row = Row(*(["Time", "Street", "Borough", "Injured",
                                                "Killed", "Auto", "Bus",
                                                "Truck", "Taxi", "Other", ]))

def transform_involved(row):
    counts = Counter([row[i] for i in range(len(row)) if collisions_out_index[i].startswith("VEHICLE TYPE CODE")])
    return collisions_transformed_row(*([row.asDict()[c] for c in ["Time", "Street", "Borough",
                                                                      "NUMBER OF PERSONS INJURED",
                                                                      "NUMBER OF PERSONS KILLED"]] + 
                                       [counts[x] if x in counts else 0
                                           for x in ['Auto', 'Bus','Truck', 'Taxi', 'Other']]))

collisions_transformed = collisions_out_categories.rdd.map(transform_involved).toDF()

In [None]:
collisions_transformed_boolean_row = Row(*(["Time", "Street", "Borough",
                                                        "AccidentsWithInjuries",
                                                        "AccidentswithDeaths", "Auto", "Bus",
                                                        "Truck", "Taxi", "Other",
                                                        "Injured", "Killed"]))

collisions_transformed_boolean = collisions_transformed.rdd.map(
    lambda row: collisions_transformed_boolean_row(*([int(row.asDict()[c] > 0) if c in ["Injured",
                                                "Killed"] else row.asDict()[c]
                                                      for c in list(collisions_transformed_row)] + 
                                                    [row.Injured, row.Killed])))

In [None]:
collisions_transformed_boolean.take(1)

In [None]:
aggregation_columns = {x:"sum" for x in ["AccidentsWithInjuries", "AccidentswithDeaths",
                                    "Auto", "Bus", "Truck", "Taxi", "Other", "Injured", "Killed"]}
aggregation_columns.update({"*":"count"})

collisions_grouped = collisions_transformed_boolean.toDF().groupBy(
    "Time", "Street", "Borough").agg(aggregation_columns)

# rename columns names
for c in collisions_grouped.columns:
    if c.startswith("sum") or c.startswith("SUM"):
        collisions_grouped = collisions_grouped.withColumnRenamed(c, c[4:-1])
    elif c.startswith("count") or c.startswith("COUNT"):
        collisions_grouped = collisions_grouped.withColumnRenamed(c, "NumberOfAccidents")

In [None]:
collisions_grouped.take(1)

In [None]:
collisions_final_row = pyspark.sql.Row(*(["Year", "Month", "Day", "Hour"] + collisions_grouped.columns[1:]))
collisions_final = collisions_grouped.rdd.map(lambda row: collisions_final_row(*([row.Time.year, row.Time.month,
                                                                              row.Time.day, row.Time.hour] +
                                                                             [row.asDict()[x]
                                                                              for x in collisions_final_row[4:]]))).toDF()

In [None]:
collisions_final.take(1)

### Determine the streets with the most collisions

Find the top ten streets in New York where the most vehicle collisions occurred. Display the results in a bar graph and as a scatter plot:

In [None]:
from pyspark.sql import functions as F

collisions_final_df = collisions_final
# Note the Spark DataFrame SQL-like methods available: groupBy, agg, sort (order by), limit
# The result is converted to a Pandas DataFrame
plottingdf = collisions_final_df.groupBy("Borough", "Street").agg(F.sum("NumberOfAccidents").alias("sum(NumberOfAccidents)")).\
sort(F.desc('sum(NumberOfAccidents)')).limit(10).toPandas()

plottingdf[['sum(NumberOfAccidents)']].plot(kind='barh', figsize=(11,7), legend=False)
plt.title('Top 10 Streets with the most accidents', size=20)
plt.xlabel('Count')
plt.yticks(range(10), plottingdf['Street'])
plt.gca().invert_yaxis()
plt.show()

#### Now you can add the information about the top 10 streets into the scatter plot.

In [None]:
data1 = collisions_out_df[['Borough', 'Street', 'LATITUDE', 'LONGITUDE']].toPandas()

collisions1 = data1[np.logical_and(data1['Street']=='atlanticav', data1['Borough']=='brooklyn')]
collisions2 = data1[np.logical_and(data1['Street']=='northernblvd', data1['Borough']=='queens')]
collisions3 = data1[np.logical_and(data1['Street']=='brdway', data1['Borough']=='manhattan')]
collisions4 = data1[np.logical_and(data1['Street']=='flatbushav', data1['Borough']=='brooklyn')]
collisions5 = data1[np.logical_and(data1['Street']=='queensblvd', data1['Borough']=='queens')]
collisions6 = data1[np.logical_and(data1['Street']=='2av', data1['Borough']=='manhattan')]
collisions7 = data1[np.logical_and(data1['Street']=='hylanblvd', data1['Borough']=='staten island')]
collisions8 = data1[np.logical_and(data1['Street']=='nostrandav', data1['Borough']=='brooklyn')]
collisions9 = data1[np.logical_and(data1['Street']=='lindenblvd', data1['Borough']=='brooklyn')]
collisions10 = data1[np.logical_and(data1['Street']=='bedfordav', data1['Borough']=='brooklyn')]

#create scatterplots
plt.figure(figsize=(15,10))
plt.scatter(data1.LONGITUDE, data1.LATITUDE, s=1, color='darkseagreen')
plt.scatter(collisions1.LONGITUDE, collisions1.LATITUDE, s=2, color='red')
plt.scatter(collisions2.LONGITUDE, collisions2.LATITUDE, color='blue', s=2)
plt.scatter(collisions3.LONGITUDE, collisions3.LATITUDE, s=2, color='magenta')
plt.scatter(collisions4.LONGITUDE, collisions4.LATITUDE, color='orange', s=2)
plt.scatter(collisions5.LONGITUDE, collisions5.LATITUDE, s=2, color='yellow')
plt.scatter(collisions6.LONGITUDE, collisions6.LATITUDE, color='purple', s=2)
plt.scatter(collisions7.LONGITUDE, collisions7.LATITUDE, s=2, color='black')
plt.scatter(collisions8.LONGITUDE, collisions8.LATITUDE, color='chartreuse', s=2)
plt.scatter(collisions9.LONGITUDE, collisions9.LATITUDE, s=2, color='brown')
plt.scatter(collisions10.LONGITUDE, collisions10.LATITUDE, color='darkgreen', s=2)


#create legend
a_patch = mpatches.Patch(color='red', label='Atlantic Avenue')
b_patch = mpatches.Patch(color='blue', label='Northern Boulevard')
c_patch = mpatches.Patch(color='magenta', label='Broadway')
d_patch = mpatches.Patch(color='orange', label='Flatbush Avenue')
e_patch = mpatches.Patch(color='yellow', label='Queens Boulevard')
f_patch = mpatches.Patch(color='purple', label='2nd Avenue')
g_patch = mpatches.Patch(color='black', label='Hylan Boulevard')
h_patch = mpatches.Patch(color='chartreuse', label='Nostrand Avenue')
i_patch = mpatches.Patch(color='brown', label='Linden Boulevard')
j_patch = mpatches.Patch(color='darkgreen', label='Bedford Avenue')

plt.legend([a_patch, b_patch, c_patch, d_patch, e_patch, f_patch, g_patch, h_patch, i_patch, j_patch],
           ('Atlantic Avenue', 'Northern Boulevard', 'Broadway', 'Flatbush Avenue', 'Queens Boulevard', '2nd Avenue',
            'Hylan Boulevard', 'Nostrand Avenue', 'Linden Boulevard', 'Bedford Avenue'), 
           loc='upper left', prop={'size':20})

#adjust more settings
plt.title('Vehicle Collisions in New York City', size=25)
plt.xlim((-74.26,-73.7))
plt.ylim((40.5,40.92))
plt.xlabel('Longitude',size=20)
plt.ylabel('Latitude',size=20)
plt.show()

### Determining when the most collisions occurred
Now find out at what time of the day the most accidents occurred and see if you can detect any interesting patterns by running the following cell:

In [None]:
from pyspark.sql import functions as F

hourplot = collisions_final_df[['Bus','Truck','Taxi','Other','Hour','Auto']].groupBy('Hour')\
.agg(F.sum("Bus").alias("Bus"), F.sum("Truck").alias("Truck"), F.sum("Taxi").alias("Taxi"),\
F.sum("Other").alias("Other"),F.sum("Auto").alias("Auto")).toPandas()

hourplot[['Bus', 'Truck', 'Taxi', 'Auto']].plot(stacked=True, kind='bar',figsize=(12,8), alpha=1)
#'SUM(Other)',
plt.xlabel('Hour', size=17)
plt.ylabel('Vehicles', size=17)
plt.legend(loc='best', prop={'size':20}, framealpha=0) 
plt.title('Collisions on Road per Hour', size=25)
plt.show()

This plot shows collisions spread across a day, with peaks during the morning and afternoon rush hours. You can see that significantly more collisions occurred during the afternoon rush hour than during the morning rush hour. Also, the most collisions involve cars by far, while buses, taxis, and trucks are involved in accidents a lot less frequently.

## Summary
This notebook showed you how to analyze car vehicle accidents based on accident reports for New York and how you can use this information to learn more about the causes for collisions. If you extract  this type of information from the data, you can use it to help develop measures for preventing  vehicle accidents in accident hotspots.

### Author
The original notebook was created by Sven Hafeneger, a member of the Watson Studio development team at IBM Analytics in Germany. He holds a M.Sc. in Bioinformatics and is passionate about data analysis, machine learning and the Python ecosystem for data science. 

Copyright © IBM Corp. 2016, 2017. This notebook and its source code are released under the terms of the MIT License.