# Data Science Playook

<p align="center">
  <img width="180" src="https://user-images.githubusercontent.com/19881320/54484151-b85c4780-4836-11e9-923f-c5e0e5afe866.jpg">
</p>

William Ponton

June 2019

## Table of Contents
- [Overview ](#overview)

- [0.0 Importing Data](#import)

- [0.1 Cleaning and Organizing](#clean)

- [0.2 Numerical Analysis](#numerical)

- [0.3 Visualizations](#visualizations)

- [0.4 Interpretation and Reporting](#interpret)

## Overview

<a id="overview"></a>

Libraries used include ```pandas```, ```numpy```, ```matplotlib```, ```seaborn```, ```bokeh```.  Let's import them all at once up front and follow the common naming standards for using these Python libraries.  I will mostly be using ```pandas``` for the importing and wrangling processes, ```numpy``` for numerical analysis, and finally ```matplotlib``` and ```seaborn``` for visualizations.

### Import the common Data Analysis libraries

In [167]:
import pandas as pd
import numpy as np
import seaborn as sb
import bokeh as bk

In [168]:
%matplotlib inline
# Inline matplotlib (keep charts in this nb)
import matplotlib.pyplot as plt

In [169]:
import warnings
warnings.filterwarnings('ignore')
# This was a warning on the KDE Plot for 2D topo mappings

## 0.0 Importing Data

<a id="import"></a>

In [170]:
# Column names
column_names = ["Face", "Suit", "Value"]

In [171]:
# Reading the CSV file using the col_names list in the names parameter:
df = pd.read_csv("app_data/deck.csv", sep=",", header=1, 
                        names = column_names, index_col=0)

In [172]:
df.head(10)

Unnamed: 0_level_0,Suit,Value
Face,Unnamed: 1_level_1,Unnamed: 2_level_1
queen,spades,12
jack,spades,11
ten,spades,10
nine,spades,9
eight,spades,8
seven,spades,7
six,spades,6
five,spades,5
four,spades,4
three,spades,3


In [173]:
# Create a DataFrame from scratch
users = {"id":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], 
         "lastName": ["Osborne", "Kilmister", "Iommi", "Page", "Mercury", "Abbot", "Plant", "Holt", "Wylde", "Ponton", "Osborne", "May", "Lee", "Dickinson", "Maiden", "Zombie", "Ulrich", "Keenan", "Mustaine", "Grohl"], 
         "firstName": ["Ozzy", "Lemmy", "Tony", "Jimmy", "Freddie", "Dimebag Darrell", "Robert", "Gary", "Zakk", "William", "Sharon", "Brian", "Geddy", "Bruce", "Eddie", "Rob", "Lars", "Maynard James", "Dave", "Dave"], 
         "userName": ["theOzzman", "LemmyIsGod", "ironMan", "zoso", "theQueen", "dimeslime", "theBard", "officerHolt", "zakkwyldebls", "handcraftedstatic", "queenBee", "bMay", "geddyLee", "captBruce", "Eddie", "zombieRob", "lars", "maynard", "mustaine", "daveG"],
         "birthYear" : [1948, 1945, 1948, 1944, 1946, 1966, 1948, 1965, 1967, 1982, 1952, 1947, 1953, 1958, 1979, 1965, 1963, 1964, 1961, 1969], 
         "points":[27, 36, 29, 9, 85, 80, 17, 75, 47, 25, 14, 50, 14, 72, 75, None, 73, 31, 84, 92], 
         "email" : ["ozman@gmail.com", "aceofspades@hotmail.com", "iamironman@gmail.com", "zoso@yahoo.com", None, "dimeslime@gmail.com", "sirRobertPlant@yahoo.com", "officerHolt@gmail.com", "tBLSt@yahoo.com", "spacejazzmusic@gmail.com", "sharonoz@gmail.com", "brianmay@hotmail.com", "geddyLee@yahoo.com", "captainBruce@gmail.com", None, "threefromhell@yahoo.com", "total_drummer@gmail.com", "mjk@gmail.com", "mechanix@gmail.com", "foofighter@gmail.com"]}

In [174]:
rockers_df = pd.DataFrame(data=users)

In [175]:
rockers_df.head(20)

Unnamed: 0,id,lastName,firstName,userName,birthYear,points,email
0,1,Osborne,Ozzy,theOzzman,1948,27.0,ozman@gmail.com
1,2,Kilmister,Lemmy,LemmyIsGod,1945,36.0,aceofspades@hotmail.com
2,3,Iommi,Tony,ironMan,1948,29.0,iamironman@gmail.com
3,4,Page,Jimmy,zoso,1944,9.0,zoso@yahoo.com
4,5,Mercury,Freddie,theQueen,1946,85.0,
5,6,Abbot,Dimebag Darrell,dimeslime,1966,80.0,dimeslime@gmail.com
6,7,Plant,Robert,theBard,1948,17.0,sirRobertPlant@yahoo.com
7,8,Holt,Gary,officerHolt,1965,75.0,officerHolt@gmail.com
8,9,Wylde,Zakk,zakkwyldebls,1967,47.0,tBLSt@yahoo.com
9,10,Ponton,William,handcraftedstatic,1982,25.0,spacejazzmusic@gmail.com


In [176]:
# Show the Name and Birth years only
rockers_df[["lastName", "firstName", "birthYear"]]

Unnamed: 0,lastName,firstName,birthYear
0,Osborne,Ozzy,1948
1,Kilmister,Lemmy,1945
2,Iommi,Tony,1948
3,Page,Jimmy,1944
4,Mercury,Freddie,1946
5,Abbot,Dimebag Darrell,1966
6,Plant,Robert,1948
7,Holt,Gary,1965
8,Wylde,Zakk,1967
9,Ponton,William,1982


In [177]:
# Show a slice of the user list (rows 3 through 7)
rockers_df[3:7]

Unnamed: 0,id,lastName,firstName,userName,birthYear,points,email
3,4,Page,Jimmy,zoso,1944,9.0,zoso@yahoo.com
4,5,Mercury,Freddie,theQueen,1946,85.0,
5,6,Abbot,Dimebag Darrell,dimeslime,1966,80.0,dimeslime@gmail.com
6,7,Plant,Robert,theBard,1948,17.0,sirRobertPlant@yahoo.com


In [178]:
# Find all null values in the points column
rockers_df[rockers_df["points"].isnull()]

Unnamed: 0,id,lastName,firstName,userName,birthYear,points,email
15,16,Zombie,Rob,zombieRob,1965,,threefromhell@yahoo.com


In [179]:
# Find all records with a null value in the email column
rockers_df[rockers_df["email"].isnull()]

Unnamed: 0,id,lastName,firstName,userName,birthYear,points,email
4,5,Mercury,Freddie,theQueen,1946,85.0,
14,15,Maiden,Eddie,Eddie,1979,75.0,


In [190]:
rockers_df["email"].fillna("None@gmail.com")

0              ozman@gmail.com
1      aceofspades@hotmail.com
2         iamironman@gmail.com
3               zoso@yahoo.com
4               None@gmail.com
5          dimeslime@gmail.com
6     sirRobertPlant@yahoo.com
7        officerHolt@gmail.com
8              tBLSt@yahoo.com
9     spacejazzmusic@gmail.com
10          sharonoz@gmail.com
11        brianmay@hotmail.com
12          geddyLee@yahoo.com
13      captainBruce@gmail.com
14              None@gmail.com
15     threefromhell@yahoo.com
16     total_drummer@gmail.com
17               mjk@gmail.com
18          mechanix@gmail.com
19        foofighter@gmail.com
Name: email, dtype: object

In [191]:
# Drop the duplications in the birthYear column
rockers_df[["birthYear"]].drop_duplicates()

Unnamed: 0,birthYear
0,1948
1,1945
3,1944
4,1946
5,1966
7,1965
8,1967
9,1982
10,1952
11,1947


In [192]:
# Return a list of the unique birthYear values
rockers_df["birthYear"].unique()

array([1948, 1945, 1944, 1946, 1966, 1965, 1967, 1982, 1952, 1947, 1953,
       1958, 1979, 1963, 1964, 1961, 1969])

In [193]:
# Display some descriptive data using the email column
rockers_df["email"].describe()

count                     18
unique                    18
top       sharonoz@gmail.com
freq                       1
Name: email, dtype: object

In [194]:
# Show records of only rockers born in the 1960s
rockers_df[(rockers_df["birthYear"] >=1960) & (rockers_df["birthYear"]<=1970)]

Unnamed: 0,id,lastName,firstName,userName,birthYear,points,email
5,6,Abbot,Dimebag Darrell,dimeslime,1966,80.0,dimeslime@gmail.com
7,8,Holt,Gary,officerHolt,1965,75.0,officerHolt@gmail.com
8,9,Wylde,Zakk,zakkwyldebls,1967,47.0,tBLSt@yahoo.com
15,16,Zombie,Rob,zombieRob,1965,,threefromhell@yahoo.com
16,17,Ulrich,Lars,lars,1963,73.0,total_drummer@gmail.com
17,18,Keenan,Maynard James,maynard,1964,31.0,mjk@gmail.com
18,19,Mustaine,Dave,mustaine,1961,84.0,mechanix@gmail.com
19,20,Grohl,Dave,daveG,1969,92.0,foofighter@gmail.com


In [195]:
# Show all records where the firstName starts with "DA" for Dave
rockers_df[rockers_df["firstName"].str.startswith('Da')]

Unnamed: 0,id,lastName,firstName,userName,birthYear,points,email
18,19,Mustaine,Dave,mustaine,1961,84.0,mechanix@gmail.com
19,20,Grohl,Dave,daveG,1969,92.0,foofighter@gmail.com


In [196]:
# Show all records where the lastName ends with "borne" for Osborne
rockers_df[rockers_df["lastName"].str.endswith('borne')]

Unnamed: 0,id,lastName,firstName,userName,birthYear,points,email
0,1,Osborne,Ozzy,theOzzman,1948,27.0,ozman@gmail.com
10,11,Osborne,Sharon,queenBee,1952,14.0,sharonoz@gmail.com


In [198]:
# Show all records where the email has a @hotmail.com domain
rockers_df[rockers_df["userName"].str.contains('z')]

Unnamed: 0,id,lastName,firstName,userName,birthYear,points,email
0,1,Osborne,Ozzy,theOzzman,1948,27.0,ozman@gmail.com
3,4,Page,Jimmy,zoso,1944,9.0,zoso@yahoo.com
8,9,Wylde,Zakk,zakkwyldebls,1967,47.0,tBLSt@yahoo.com
15,16,Zombie,Rob,zombieRob,1965,,threefromhell@yahoo.com


In [204]:
# Show all first and last name for all records where first name contains "J"
rockers_df[rockers_df["firstName"].str.contains('J')][["firstName", "lastName", "birthYear"]]

Unnamed: 0,firstName,lastName,birthYear
3,Jimmy,Page,1944
17,Maynard James,Keenan,1964


In [199]:
# Find the mean of the values in the points column
rockers_df["points"].mean()

49.21052631578947

In [200]:
# Find the mean of the values in the birthYear column
rockers_df["birthYear"].mean()

1958.5

In [None]:
# Create a DataFrame from scratch
ufo_df = pd.read_csv("app_data/ufo_sightings_complete.csv", sep=",", header=0, index_col=0)

## 0.2 Numerical Analysis
<a id="numerical"></a>

## 0.3 Visualizations
<a id="visualizations"></a>

## 0.4 Interpretation & Reporting

<a id="interpret"></a>