# Data Wrangling

Dealing with and or converting missing or ill-formated data into a format that more easily lends itself to analysis

### Most common Data Formats:

1) CSV

2) XML

3) JSON

## Baseball Database

http://seanlahman.com/baseball-archive/statistics

In [57]:
import numpy as np
import pandas as pd
import os

In [73]:
input_dir = os.getcwd() + '/data/input/'
output_dir = os.getcwd() + '/data/output/'

In [74]:
baseball_data = pd.read_csv(input_dir + 'baseballdatabank-2017.1/core/Master.csv')

In [75]:
print (baseball_data['nameFirst'])

0           David
1            Hank
2          Tommie
3             Don
4            Andy
5        Fernando
6            John
7              Ed
8            Bert
9         Charlie
10            Dan
11           Fred
12          Glenn
13           Jeff
14            Jim
15           Kurt
16           Kyle
17            Ody
18           Paul
19             Al
20          Frank
21         Reggie
22           Bill
23          Brent
24            Ted
25            Ted
26          Woody
27          Cliff
28          Harry
29          Shawn
           ...   
19075      Jordan
19076         Roy
19077        Ryan
19078     Charlie
19079      Walter
19080       Frank
19081         Guy
19082       Jimmy
19083        Bill
19084        Alan
19085         Bud
19086      Richie
19087       Barry
19088       Billy
19089          Ed
19090         Ben
19091       Peter
19092         Sam
19093       Eddie
19094        Bill
19095         Jon
19096       Julio
19097        Joel
19098        Mike
19099     

### Add a new column

In [76]:
baseball_data['height_plus_weight'] = baseball_data['height'] + baseball_data['weight']

In [77]:
print (baseball_data['height_plus_weight'])

0        290.0
1        252.0
2        265.0
3        265.0
4        257.0
5        293.0
6        264.0
7        241.0
8        246.0
9        237.0
10       261.0
11       250.0
12       278.0
13       264.0
14       275.0
15       251.0
16       276.0
17       254.0
18       260.0
19       269.0
20         NaN
21       290.0
22       260.0
23       258.0
24       284.0
25       291.0
26       242.0
27       272.0
28       274.0
29       263.0
         ...  
19075    299.0
19076    261.0
19077    300.0
19078    263.0
19079    237.0
19080    218.0
19081    240.0
19082    267.0
19083    258.0
19084    274.0
19085    275.0
19086    273.0
19087    279.0
19088    245.0
19089    252.0
19090    285.0
19091    274.0
19092    256.0
19093    247.0
19094    269.0
19095    263.0
19096    308.0
19097    290.0
19098    294.0
19099    296.0
19100    253.0
19101    245.0
19102    271.0
19103    226.0
19104    265.0
Name: height_plus_weight, Length: 19105, dtype: float64


### Export the new DataFrame to CSV (with weight-height sum column)

For more info: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

In [78]:
baseball_data.to_csv(output_dir + 'baseball_data_with_weight_height_sum.csv')

# Data Wrangling - Lesson 3 Quiz 

In [79]:
def add_full_name(path_to_csv, path_to_new_csv):
    
    baseball_data = pd.read_csv(path_to_csv)
    baseball_data['nameFull'] = baseball_data['nameFirst'] + " " + baseball_data['nameLast']
    
    baseball_data.to_csv(path_to_new_csv)

In [80]:
path_to_csv = input_dir + '/baseballdatabank-2017.1/core/Master.csv'
path_to_new_csv = output_dir + 'baseball_data_with_fullname.csv'

add_full_name(path_to_csv, path_to_new_csv)

# Working with Relational Databases using Pandasql

1) It is straight forward to extract aggregated data with complex filters

2) A database scales well

3) It ensures all data is consistently formatted (each column, one format)

## Aadhaar Data - Lesson 3 Quiz

https://s3.amazonaws.com/content.udacity-data.com/courses/ud359/aadhaar_data.csv

In [86]:
import pandasql as pdsql

In [92]:

def select_first_50(filename):
    #Read aadhaar_data csv to a pandas dataframe.  
    aadhaar_data = pd.read_csv(input_dir + filename)
    
    #Rename the columns by replacing spaces with underscores and setting all characters to lowercase
    aadhaar_data.rename(columns = lambda x: x.replace(' ', '_').lower(), inplace=True)

    # Select out the first 50 values for "registrar" and "enrolment_agency"
    # in the aadhaar_data table using SQL syntax. 
    q = """
        SELECT registrar, enrolment_agency FROM aadhaar_data LIMIT 50;
    """

    #Execute your SQL command against the pandas frame
    aadhaar_solution = pdsql.sqldf(q.lower(), locals())
    return aadhaar_solution    

In [94]:
select_first_50('aadhaar-data.csv')

Unnamed: 0,registrar,enrolment_agency
0,Allahabad Bank,Tera Software Ltd
1,Allahabad Bank,Tera Software Ltd
2,Allahabad Bank,Vakrangee Softwares Limited
3,Allahabad Bank,Vakrangee Softwares Limited
4,Allahabad Bank,Vakrangee Softwares Limited
5,Allahabad Bank,Vakrangee Softwares Limited
6,Allahabad Bank,Vakrangee Softwares Limited
7,Allahabad Bank,Vakrangee Softwares Limited
8,Allahabad Bank,Vakrangee Softwares Limited
9,Allahabad Bank,Vakrangee Softwares Limited


In [None]:
def select_first_50(state):
    aadhaar_data = pd.read_csv(input_dir + filename)
    
    #Rename the columns by replacing spaces with underscores and setting all characters to lowercase
    aadhaar_data.rename(columns = lambda x: x.replace(' ', '_').lower(), inplace=True)

    q = """
        SELECT * FROM aadhaar_data WHERE state = state;
    """

    #Execute your SQL command against the pandas frame
    aadhaar_solution = pdsql.sqldf(q.lower(), locals())
    return aadhaar_solution    