# Data Science Lesson - Getting Data
---
### Reading from CSV files
A comma-separated values (CSV) file is a very common, generic file format used for data storage and transfer.
There is the "vanilla" Python way to get data out of a CSV, and there is the pandas way.
See https://realpython.com/python-csv/ to determine which you prefer. :)

Open your Google Sheets file of people data from the last lesson. Use File > Download to get a CSV locally and place it in the same directory as this notebook. Rename it "people.csv".

Now, import pandas as pd and use the .read_csv function to read the contents of people.csv into a pandas dataframe. Output the dataframe to see what it looks like.

In [1]:
import pandas as pd
people = pd.read_csv('people.csv')
people

Unnamed: 0,Name,Height,GPA,Friends,Sport,Shoes
0,Dakota,72,3.15,307,basketball,sneakers
1,Hayden,68,3.5,335,tennis,flip flops
2,Charlie,61,1.1,34,baseball,flip flops
3,Kamryn,66,2.18,200,soccer,sneakers
4,Emerson,65,3.06,213,soccer,sneakers
5,Jessie,61,2.41,202,basketball,flip flops
6,Sawyer,67,2.96,314,tennis,flip flops
7,London,64,3.98,436,soccer,sneakers


**Question:** What's a DataFrame?

**Answer:** It's our new best friend!

Now that you have data in a pandas dataframe, use .info() to get details about the columns.

In [2]:
people.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Name     8 non-null      object 
 1   Height   8 non-null      int64  
 2   GPA      8 non-null      float64
 3   Friends  8 non-null      int64  
 4   Sport    8 non-null      object 
 5   Shoes    8 non-null      object 
dtypes: float64(1), int64(2), object(3)
memory usage: 512.0+ bytes


And now use .describe() to get a lot more useful informations.

In [3]:
people.describe()

Unnamed: 0,Height,GPA,Friends
count,8.0,8.0,8.0
mean,65.5,2.7925,255.125
std,3.664502,0.888349,120.58481
min,61.0,1.1,34.0
25%,63.25,2.3525,201.5
50%,65.5,3.01,260.0
75%,67.25,3.2375,319.25
max,72.0,3.98,436.0


## Weather Data
From https://github.com/fivethirtyeight/data/tree/master/us-weather-history we can get weather data as CSV files from many different airports.

Download a CSV file from the above site. (Make sure to pick one that no one else chooses.) You'll need to view the "raw" page and save the file locally as a .csv (not .txt)

Then, read the file into a dataframe and output it to verify.

In [4]:
weather = pd.read_csv('KPHX.csv')
weather

Unnamed: 0,date,actual_mean_temp,actual_min_temp,actual_max_temp,average_min_temp,average_max_temp,record_min_temp,record_max_temp,record_min_temp_year,record_max_temp_year,actual_precipitation,average_precipitation,record_precipitation
0,2014-7-1,98,86,109,82,107,65,115,1927,1990,0.00,0.02,2.68
1,2014-7-2,98,86,109,82,107,65,118,1911,2011,0.00,0.01,2.81
2,2014-7-3,94,79,108,82,107,64,117,1916,1907,0.00,0.02,0.22
3,2014-7-4,90,81,98,83,107,63,118,1912,1989,0.00,0.02,0.22
4,2014-7-5,94,84,103,83,107,63,116,1912,1983,0.01,0.02,0.18
...,...,...,...,...,...,...,...,...,...,...,...,...,...
360,2015-6-26,98,89,107,81,106,62,122,1963,1990,0.00,0.00,0.07
361,2015-6-27,99,90,108,81,106,55,118,1965,1990,0.01,0.00,0.04
362,2015-6-28,99,87,110,81,106,59,118,1965,1990,0.00,0.01,0.26
363,2015-6-29,98,86,110,81,107,59,119,1913,2013,0.05,0.00,0.09


Get the info for the data set.

In [5]:
weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 365 entries, 0 to 364
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   date                   365 non-null    object 
 1   actual_mean_temp       365 non-null    int64  
 2   actual_min_temp        365 non-null    int64  
 3   actual_max_temp        365 non-null    int64  
 4   average_min_temp       365 non-null    int64  
 5   average_max_temp       365 non-null    int64  
 6   record_min_temp        365 non-null    int64  
 7   record_max_temp        365 non-null    int64  
 8   record_min_temp_year   365 non-null    int64  
 9   record_max_temp_year   365 non-null    int64  
 10  actual_precipitation   365 non-null    float64
 11  average_precipitation  365 non-null    float64
 12  record_precipitation   365 non-null    float64
dtypes: float64(3), int64(9), object(1)
memory usage: 37.2+ KB


And now describe the data.

In [6]:
weather.describe()

Unnamed: 0,actual_mean_temp,actual_min_temp,actual_max_temp,average_min_temp,average_max_temp,record_min_temp,record_max_temp,record_min_temp_year,record_max_temp_year,actual_precipitation,average_precipitation,record_precipitation
count,365.0,365.0,365.0,365.0,365.0,365.0,365.0,365.0,365.0,365.0,365.0,365.0
mean,77.320548,65.761644,88.389041,63.50137,86.717808,43.016438,100.010959,1934.356164,1976.279452,0.027726,0.022,0.770466
std,14.055375,14.095072,14.506228,14.070588,14.704034,14.946783,12.650626,24.477041,29.460417,0.205119,0.012649,0.537395
min,41.0,31.0,46.0,44.0,65.0,16.0,77.0,1895.0,1896.0,0.0,0.0,0.01
25%,66.0,54.0,77.0,50.0,72.0,30.0,88.0,1913.0,1956.0,0.0,0.01,0.36
50%,77.0,65.0,89.0,62.0,87.0,40.0,102.0,1929.0,1985.0,0.0,0.02,0.71
75%,90.0,79.0,101.0,78.0,102.0,56.0,112.0,1961.0,2000.0,0.0,0.03,1.05
max,105.0,94.0,116.0,84.0,107.0,70.0,122.0,1990.0,2015.0,3.29,0.05,3.29


Take a look at the mean and std of the actual_mean_temp. That's the average temperature of the airport over the whole year, and the standard deviation. Compare with others to see if you can tell whose airports are more temperate and more volatile. Then look up the airport by its code and see if your observations make sense.

## Reading directly from the web
We can also get CSV directly from the web without saving the file locally. Note that this creates a dependency on the host of data. If that resource is moved (or removed) our script will stop functioning.

Try getting data on surnames from here: https://raw.githubusercontent.com/fivethirtyeight/data/master/most-common-name/surnames.csv

Then, investigate the data using pandas tools we just learned.

In [12]:
surnames = pd.read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/most-common-name/surnames.csv')
surnames

Unnamed: 0,name,rank,count,prop100k,cum_prop100k,pctwhite,pctblack,pctapi,pctaian,pct2prace,pcthispanic
0,SMITH,1,2376206,880.85,880.85,73.35,22.22,0.4,0.85,1.63,1.56
1,JOHNSON,2,1857160,688.44,1569.30,61.55,33.8,0.42,0.91,1.82,1.5
2,WILLIAMS,3,1534042,568.66,2137.96,48.52,46.72,0.37,0.78,2.01,1.6
3,BROWN,4,1380145,511.62,2649.58,60.71,34.54,0.41,0.83,1.86,1.64
4,JONES,5,1362755,505.17,3154.75,57.69,37.73,0.35,0.94,1.85,1.44
...,...,...,...,...,...,...,...,...,...,...,...
151666,YOUSKO,150436,100,0.04,89752.93,99,(S),0,0,0,(S)
151667,ZAITSEV,150436,100,0.04,89753.04,92,(S),0,0,7,(S)
151668,ZALLA,150436,100,0.04,89753.11,99,(S),0,0,0,(S)
151669,ZERBEY,150436,100,0.04,89753.30,99,(S),0,0,0,(S)


In [13]:
surnames.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 151671 entries, 0 to 151670
Data columns (total 11 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   name          151670 non-null  object 
 1   rank          151671 non-null  int64  
 2   count         151671 non-null  int64  
 3   prop100k      151671 non-null  float64
 4   cum_prop100k  151671 non-null  float64
 5   pctwhite      151671 non-null  object 
 6   pctblack      151671 non-null  object 
 7   pctapi        151671 non-null  object 
 8   pctaian       151671 non-null  object 
 9   pct2prace     151671 non-null  object 
 10  pcthispanic   151671 non-null  object 
dtypes: float64(2), int64(2), object(7)
memory usage: 12.7+ MB


In [14]:
surnames.describe()

Unnamed: 0,rank,count,prop100k,cum_prop100k
count,151671.0,151671.0,151671.0,151671.0
mean,75649.497781,1596.357,0.591744,82520.575351
std,43614.414271,16338.75,6.056723,8902.405422
min,1.0,100.0,0.04,880.85
25%,37881.0,143.0,0.05,80519.25
50%,75695.0,237.0,0.09,85509.67
75%,113519.0,551.0,0.2,88079.475
max,150436.0,2376206.0,880.85,89753.56
