# CSV Exercise

In [10]:
import pandas
import pandasql

Assume you will be reading in a csv file with the same columns that the #Lahman baseball data set has -- most importantly, there are columns called `nameFirst` and `nameLast`.

1. Write a function that reads a csv located at "path_to_csv" into a pandas dataframe and adds a new column called `nameFull` with a player's full name. For example, for Hank Aaron, `nameFull` would be `Hank Aaron`
1. Write the data in the pandas dataFrame to a new csv file located at #path_to_new_csv



In [6]:
def add_full_name(path_to_csv, path_to_new_csv):
    dataframe = pandas.read_csv(path_to_csv)
    dataframe['nameFull'] = dataframe['nameFirst'] + ' ' + dataframe['nameLast']
    dataframe.to_csv(path_to_new_csv)

add_full_name('Master.csv', 'new_master.csv')

# Write Your Own Simple Query

Read in our aadhaar_data csv to a pandas dataframe.  Afterwards, we rename the columns by replacing spaces with underscores and setting all characters to lowercase, so the column names more closely resemble columns names one might find in a table.

Select out the first 50 values for `registrar` and `enrolment_agency` in the aadhaar_data table using SQL syntax. 
Note that `enrolment_agency` is spelled with one "l". Also, the order of the select does matter. Make sure you select registrar then enrolment agency in your query.

In [11]:
def select_first_50(filename):
    aadhaar_data = pandas.read_csv(filename)
    aadhaar_data.rename(columns = lambda x: x.replace(' ', '_').lower(), inplace=True)
    q = """
    SELECT registrar, enrolment_agency FROM aadhaar_data LIMIT 50;
    """
    aadhaar_solution = pandasql.sqldf(q.lower(), locals())
    return aadhaar_solution

select_first_50('aadhaar_data.csv')

Unnamed: 0,registrar,enrolment_agency
0,Allahabad Bank,Tera Software Ltd
1,Allahabad Bank,Tera Software Ltd
2,Allahabad Bank,Vakrangee Softwares Limited
3,Allahabad Bank,Vakrangee Softwares Limited
4,Allahabad Bank,Vakrangee Softwares Limited
5,Allahabad Bank,Vakrangee Softwares Limited
6,Allahabad Bank,Vakrangee Softwares Limited
7,Allahabad Bank,Vakrangee Softwares Limited
8,Allahabad Bank,Vakrangee Softwares Limited
9,Allahabad Bank,Vakrangee Softwares Limited


# Write Your Own Complex Query

Write a query that will select from the aadhaar_data table how many men and how many women over the age of 50 have had aadhaar generated for them in each district.

`aadhaar_generated` is a column in the Aadhaar Data that denotes the number who have had aadhaar generated in each row of the table.

Note that in this quiz, the SQL query keywords are case sensitive. For example, if you want to do a sum make sure you type `sum` rather than `SUM`.

The possible columns to select from aadhaar data are:

1. registrar
2. enrolment_agency
3. state
4. district
5. sub_district
6. pin_code
7. gender
8. age
9. aadhaar_generated
10. enrolment_rejected
11. residents_providing_email,
12. residents_providing_mobile_number

In [12]:
def aggregate_query(filename):
    aadhaar_data = pandas.read_csv(filename)
    aadhaar_data.rename(columns = lambda x: x.replace(' ', '_').lower(), inplace=True)
    q = """
    SELECT gender, district, sum(aadhaar_generated) FROM aadhaar_data 
    WHERE age > 50 GROUP BY gender, district
    """
    aadhaar_solution = pandasql.sqldf(q.lower(), locals())
    return aadhaar_solution 

aggregate_query('aadhaar_data.csv')

Unnamed: 0,gender,district,sum(aadhaar_generated)
0,F,Ahmadnagar,45
1,F,Ahmed Nagar,0
2,F,Ahmedabad,1
3,F,Ajmer,27
4,F,Akola,5
5,F,Alirajpur,71
6,F,Allahabad,15
7,F,Alwar,14
8,F,Ambala,7
9,F,Amravati,0


# API Exercise

In [17]:
import json
import requests

# Get your own
API_KEY = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'

def api_get_request(url):
    url = 'http://ws.audioscrobbler.com/2.0/?method=geo.gettopartists&country=spain&format=json&api_key=' + API_KEY
    data = requests.get(url).text
    data = json.loads(data)
    return data['topartists']['artist'][0]['name']

# Imputation Exercise

Pandas dataframes have a method called `fillna(value)`, such that you can pass in a single value to replace any NAs in a dataframe or series. You can call it like this: 
```
dataframe['column'] = dataframe['column'].fillna(value)
```

Using the `numpy.mean` function, which calculates the mean of a numpy array, impute any missing values in our Lahman baseball data sets 'weight' column by setting them equal to the average weight.

You can access the `weight` colum in the baseball data frame by calling `baseball['weight']`

In [19]:
import numpy

def imputation(filename):
    baseball = pandas.read_csv(filename)
    avg_weight = numpy.mean(baseball['weight'])
    baseball['weight'] = baseball['weight'].fillna(avg_weight)
    return baseball

imputation('Master.csv')

Unnamed: 0,playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,...,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID
0,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,...,Aardsma,David Allan,215.000000,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01
1,aaronha01,1934.0,2.0,5.0,USA,AL,Mobile,,,,...,Aaron,Henry Louis,180.000000,72.0,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01
2,aaronto01,1939.0,8.0,5.0,USA,AL,Mobile,1984.0,8.0,16.0,...,Aaron,Tommie Lee,190.000000,75.0,R,R,1962-04-10,1971-09-26,aarot101,aaronto01
3,aasedo01,1954.0,9.0,8.0,USA,CA,Orange,,,,...,Aase,Donald William,190.000000,75.0,R,R,1977-07-26,1990-10-03,aased001,aasedo01
4,abadan01,1972.0,8.0,25.0,USA,FL,Palm Beach,,,,...,Abad,Fausto Andres,184.000000,73.0,L,L,2001-09-10,2006-04-13,abada001,abadan01
5,abadfe01,1985.0,12.0,17.0,D.R.,La Romana,La Romana,,,,...,Abad,Fernando Antonio,220.000000,73.0,L,L,2010-07-28,2016-09-25,abadf001,abadfe01
6,abadijo01,1850.0,11.0,4.0,USA,PA,Philadelphia,1905.0,5.0,17.0,...,Abadie,John W.,192.000000,72.0,R,R,1875-04-26,1875-06-10,abadj101,abadijo01
7,abbated01,1877.0,4.0,15.0,USA,PA,Latrobe,1957.0,1.0,6.0,...,Abbaticchio,Edward James,170.000000,71.0,R,R,1897-09-04,1910-09-15,abbae101,abbated01
8,abbeybe01,1869.0,11.0,11.0,USA,VT,Essex,1962.0,6.0,11.0,...,Abbey,Bert Wood,175.000000,71.0,R,R,1892-06-14,1896-09-23,abbeb101,abbeybe01
9,abbeych01,1866.0,10.0,14.0,USA,NE,Falls City,1926.0,4.0,27.0,...,Abbey,Charles S.,169.000000,68.0,L,L,1893-08-16,1897-08-19,abbec101,abbeych01
