# Pandas and numpy - pair-up
### Discussion session

1. How will you read the following data into a pandas data frame ? 
 ` ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt`

In [106]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [107]:
# file = pd.read_csv('ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt')
# downloaded in this directory 
file = pd.read_csv('co2_mm_mlo.txt', skiprows=72, names=['year', 'month', 'decimal date', 'average', 'interpolated', 'trend', '#days'],delim_whitespace=True)
file.head(5)

Unnamed: 0,year,month,decimal date,average,interpolated,trend,#days
0,1958,3,1958.208,315.71,315.71,314.62,-1
1,1958,4,1958.292,317.45,317.45,315.29,-1
2,1958,5,1958.375,317.5,317.5,314.71,-1
3,1958,6,1958.458,-99.99,317.1,314.85,-1
4,1958,7,1958.542,315.86,315.86,314.98,-1


 2. How would you pick columns 0,1,3 ?  
`[[0, 1, 3]]`

In [108]:
# file[['year', 'month', 'average']]
column_names = file.columns
file[[column_names[0],column_names[1],column_names[3]]].head()

#iloc[rows,columns]
# file = file.iloc[:,[0,1,3]]
# file.head()

# filemish = file[[0,1,3]]
# filemish 


Unnamed: 0,year,month,average
0,1958,3,315.71
1,1958,4,317.45
2,1958,5,317.5
3,1958,6,-99.99
4,1958,7,315.86


3. Use a for loop to find all rows where 
Co2 (column 3) enteries with the value -99.99 (these are missing values) and replace them with NaN values (try using np.nan - do you know what it is? )

In [109]:
for entry,index in zip(file['average'], range(len(file))):
    if entry == -99.99:
        file['average'][index]  = np.nan
file.head()

# Clean you Kernal! This does not work. 
# for entry in file['average']:
#     if entry == -99.99:
#         file['average']  = np.nan
# file.head()

# If there are no column names, use the index of the column. 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,year,month,decimal date,average,interpolated,trend,#days
0,1958,3,1958.208,315.71,315.71,314.62,-1
1,1958,4,1958.292,317.45,317.45,315.29,-1
2,1958,5,1958.375,317.5,317.5,314.71,-1
3,1958,6,1958.458,,317.1,314.85,-1
4,1958,7,1958.542,315.86,315.86,314.98,-1


In [110]:
file2 = file.drop(columns=['interpolated'])
file2.head(5)


Unnamed: 0,year,month,decimal date,average,trend,#days
0,1958,3,1958.208,315.71,314.62,-1
1,1958,4,1958.292,317.45,315.29,-1
2,1958,5,1958.375,317.5,314.71,-1
3,1958,6,1958.458,,314.85,-1
4,1958,7,1958.542,315.86,314.98,-1


4. Change names of columns to year, month, and CO2 (use colnames)

In [111]:
file2.rename(columns = {'average':'CO2'}, inplace = True)
file2.head(5)

Unnamed: 0,year,month,decimal date,CO2,trend,#days
0,1958,3,1958.208,315.71,314.62,-1
1,1958,4,1958.292,317.45,315.29,-1
2,1958,5,1958.375,317.5,314.71,-1
3,1958,6,1958.458,,314.85,-1
4,1958,7,1958.542,315.86,314.98,-1


5. Add a column 'Day' and specifiy the day 15 for all enteries

In [112]:
file2['day'] = 15 

In [113]:
file2.head()

Unnamed: 0,year,month,decimal date,CO2,trend,#days,day
0,1958,3,1958.208,315.71,314.62,-1,15
1,1958,4,1958.292,317.45,315.29,-1,15
2,1958,5,1958.375,317.5,314.71,-1,15
3,1958,6,1958.458,,314.85,-1,15
4,1958,7,1958.542,315.86,314.98,-1,15


In [114]:
file3 = file2.drop(columns = ['#days'])
file3.head(5)

Unnamed: 0,year,month,decimal date,CO2,trend,day
0,1958,3,1958.208,315.71,314.62,15
1,1958,4,1958.292,317.45,315.29,15
2,1958,5,1958.375,317.5,314.71,15
3,1958,6,1958.458,,314.85,15
4,1958,7,1958.542,315.86,314.98,15


6. Add a date column according to the 'year', 'month' and 'day' columns (options: use apply with lambda or for loop together with datetime.date (make sure to import it)) 

In [115]:
import datetime 
file3['date'] = pd.to_datetime(file3[['month','day', 'year']], format='%m/%d/%y')
file3.head(3)


Unnamed: 0,year,month,decimal date,CO2,trend,day,date
0,1958,3,1958.208,315.71,314.62,15,1958-03-15
1,1958,4,1958.292,317.45,315.29,15,1958-04-15
2,1958,5,1958.375,317.5,314.71,15,1958-05-15


7. Drop the 'Day' column

In [116]:
file4 = file3.drop(columns = ['day'])
file4.head(3)


Unnamed: 0,year,month,decimal date,CO2,trend,date
0,1958,3,1958.208,315.71,314.62,1958-03-15
1,1958,4,1958.292,317.45,315.29,1958-04-15
2,1958,5,1958.375,317.5,314.71,1958-05-15


8. use pandas groupby to print the yearly avg. of co2 per year. 

In [117]:
file4.groupby('year').mean()
file4.head(2)

Unnamed: 0,year,month,decimal date,CO2,trend,date
0,1958,3,1958.208,315.71,314.62,1958-03-15
1,1958,4,1958.292,317.45,315.29,1958-04-15


9. Pick columns that you think could be used to build a model and store them in numpy array (Answer why do we do that?)

In [118]:
file4.columns

Index(['year', 'month', 'decimal date', 'CO2', 'trend', 'date'], dtype='object')

In [119]:

#See how CO2 changes over time. 
co2_over_year = file4

10. repeat step (3) but this time using the np.where command. 

In [126]:
file_npwhere = np.where(file.iloc[:,3]==-99.99, np.nan,file.iloc[:,3])
# file_npwhere


11. Download the notebook as .py script and run it from your terminal. 

12. Create a branch in github repository called warm_up_draft   

13. push the notebook with the name CO2 to your new branch on github.

*** Optional 

Open the notebook from yesterday and create a programme that selects randomly pairings of people. (hints: use numpy.random.choice)