# Pandas

While NumPy can be used to import data, it is optimized around numerical data. Many data sets include categorical variables. For these data sets, it is best to use a library called `pandas`, which focuses on creating and manipulating data frames. 

In [1]:
import pandas as pd

### Read data
With `pandas` imported, we can read in .csv files with the `pandas` function `read_csv()`.

In that function, we can specify the file we want to use with a URL or with the path to a local file as a string.

This saves the data in a structure called a DataFrame.

We are going to be using data on [long term average precipitation and temperature values in Boston from ~1980s-2010 from NOAA](https://www.ncei.noaa.gov/data/normals-monthly/doc/NORMAL_MLY_documentation.pdf).

In [3]:
filename = "C:/Users/delre/python/data/boston_precip_temp.csv" #when you paste the address in, there might
#be an error because the slashes are backwards - just change them, it will work

# filename = "https://raw.githubusercontent.com/ENVS110a-SP23/python/main/data/boston_precip_temp.csv"

df = pd.read_csv(filename)



Our data is now saved as a data frame in Python as the variable `df`. With the data now in the environment, we can take a look at the first few rows with `df.head()`.

In [5]:
df.head()

Unnamed: 0,station,name,date,temp,diurnal_temp_range,precip-total,snow-totals
0,USW00054704,"NORWOOD MEMORIAL AIRPORT, MA US",1,25.9,19.7,3.43,
1,USW00054704,"NORWOOD MEMORIAL AIRPORT, MA US",2,28.9,21.0,3.25,
2,USW00054704,"NORWOOD MEMORIAL AIRPORT, MA US",3,36.4,21.5,4.45,
3,USW00054704,"NORWOOD MEMORIAL AIRPORT, MA US",4,46.8,22.7,4.19,
4,USW00054704,"NORWOOD MEMORIAL AIRPORT, MA US",5,56.4,24.9,3.68,


We can see that this data frame has several different columns, with information about stations, precipitation and temperature.

If you have an excel file you can also `pd.read_excel()`. You can specify the sheet name, as well. The default is the first sheet, and you can provide either a single sheet name, or a list of sheets you want as an alternative, which gives you a dictionary of pandas DataFrames.

If you say `sheet_name=None`, you will get all of the sheets back.

In [10]:
xlsx = data_dir + "boston_precip_temp.xlsx"

xlsx = "https://raw.githubusercontent.com/ENVS110a-SP23/python/main/data/boston_precip_temp.xlsx"
pd.read_excel(xlsx)



Unnamed: 0,station,name,date,temp,diurnal_temp_range,precip-total,snow-totals
0,USW00054704,"NORWOOD MEMORIAL AIRPORT, MA US",1,25.9,19.7,3.43,
1,USW00054704,"NORWOOD MEMORIAL AIRPORT, MA US",2,28.9,21.0,3.25,
2,USW00054704,"NORWOOD MEMORIAL AIRPORT, MA US",3,36.4,21.5,4.45,
3,USW00054704,"NORWOOD MEMORIAL AIRPORT, MA US",4,46.8,22.7,4.19,
4,USW00054704,"NORWOOD MEMORIAL AIRPORT, MA US",5,56.4,24.9,3.68,
...,...,...,...,...,...,...,...
151,USC00194760,"MILFORD, MA US",8,,,4.03,0.0
152,USC00194760,"MILFORD, MA US",9,,,3.94,0.0
153,USC00194760,"MILFORD, MA US",10,,,4.57,0.0
154,USC00194760,"MILFORD, MA US",11,,,4.47,0.9


In [11]:
# pd.read_excel(xlsx)

pd.read_excel(xlsx, sheet_name=0) 
pd.read_excel(xlsx, sheet_name=1) 

Unnamed: 0,station,name,date,temp,diurnal_temp_range,precip-total,snow-totals
0,USC00190860,"BROCKTON, MA US",1,27.0,19.3,3.75,
1,USC00190860,"BROCKTON, MA US",2,29.6,19.6,3.59,
2,USC00190860,"BROCKTON, MA US",3,36.7,20.0,5.18,
3,USC00190860,"BROCKTON, MA US",4,46.3,21.6,4.63,
4,USC00190860,"BROCKTON, MA US",5,56.2,23.1,3.58,
...,...,...,...,...,...,...,...
139,USC00195984,"NORTON, MA US",8,69.9,23.1,4.04,0.0
140,USC00195984,"NORTON, MA US",9,61.8,23.9,3.99,0.0
141,USC00195984,"NORTON, MA US",10,50.6,22.8,4.39,-7777.0
142,USC00195984,"NORTON, MA US",11,42.0,20.8,4.79,1.5


In [3]:
pd.read_excel(xlsx, sheet_name=['Sheet1','Sheet2'])

In [None]:
pd.read_excel(xlsx)

## Making sure data is in correct form

When the data does not have the standard format, there can be issues. This tends to happen when the first line of the .csv file is not column names.

For an example, we'll take a look at [a data set of two files on arctic vegetation plots](http://dx.doi.org/10.3334/ORNLDAAC/1358).

In [2]:
environmental_data = data_dir + "Arrigetch_Peaks_Environmental_Data_raw.csv"
environmental_data = "https://raw.githubusercontent.com/ENVS110a-SP23/python/main/data/Arrigetch_Peaks_Environmental_Data_raw.csv"
pd.read_csv(environmental_data)

In [17]:
species_file = data_dir + "/Arrigetch_Peaks_Species_Data_raw.csv"
species_file = "/Users/fordfishman/GitHub/envs110/python/data/Arrigetch_Peaks_Species_Data_raw.csv"


### Question

For in-class questions, we'll be working with a data set called Gapminder. It is in the `data` subdirectory in this repo as `gapminder.csv`. You can also find it at this stable url: `https://raw.githubusercontent.com/ENVS110a-SP23/python/main/data/gapminder.csv`.

Load this data set and display the first few rows with `.head()`. **Make sure to save it as a different variable name than `df` to make sure you don't overwrite the precipitation and temperature data frame.**

In [18]:
# your code here
df2 = pd.read_csv("https://raw.githubusercontent.com/ENVS110a-SP23/python/main/data/gapminder.csv")

## Summarize data frame

It is important to understand the data we are working with before we begin analysis. First, let's look at the dimenions of the data frame using `df.shape`. It gives the number of rows by the number of columns.

In [20]:
df.head()

df.shape #this gives an "attribute" of the data frame; sends back a Tuple -- the number of columns followed
#by the number of rows]

(300, 7)

This shows that our data frame has 300 rows by 7 columns.

We can get out those numbers individually through indexing.

In [21]:
nrows = df.shape[0]
ncolumns = df.shape[1]

7

`len(df)` also gets back how many rows you have.

In [22]:
len(df)

300

We can also use `df.columns` to display the column names.

In [23]:
df.columns

Index(['station', 'name', 'date', 'temp', 'diurnal_temp_range', 'precip-total',
       'snow-totals'],
      dtype='object')

## Renaming columns and rows

We can rename as many columns as you want with `df.rename(columns = {old_name:new_name,...})`. 

Note that you need to re-assign to `df` or make a new variable if you want to save the renamed columns.

In [27]:
df.rename(columns={'diurnal_temp_range':'dtr'}) #this does not save the change
df = df.rename(columns={'diurnal_temp_range':'dtr'}) #this saves the change in Python, but not the OG file

We can also re-assign row names by saying `index` instead of `columns`. This is more rare, however.

In [28]:
df.rename(index={0:'first', 1:'second'})

Unnamed: 0,station,name,date,temp,dtr,precip-total,snow-totals
first,USW00054704,"NORWOOD MEMORIAL AIRPORT, MA US",1,25.9,19.7,3.43,
second,USW00054704,"NORWOOD MEMORIAL AIRPORT, MA US",2,28.9,21.0,3.25,
2,USW00054704,"NORWOOD MEMORIAL AIRPORT, MA US",3,36.4,21.5,4.45,
3,USW00054704,"NORWOOD MEMORIAL AIRPORT, MA US",4,46.8,22.7,4.19,
4,USW00054704,"NORWOOD MEMORIAL AIRPORT, MA US",5,56.4,24.9,3.68,
...,...,...,...,...,...,...,...
295,USC00195984,"NORTON, MA US",8,69.9,23.1,4.04,0.0
296,USC00195984,"NORTON, MA US",9,61.8,23.9,3.99,0.0
297,USC00195984,"NORTON, MA US",10,50.6,22.8,4.39,-7777.0
298,USC00195984,"NORTON, MA US",11,42.0,20.8,4.79,1.5


### Question 

Using the gapminder data frame, print out the column names. Rename the `age5_surviving` and `babies_per_woman` columns to be shorter.

In [29]:
# your code here: 

df2.rename(columns={'babies_per_woman': 'bpw', 'age5_surviving': 'a5s'})

Unnamed: 0,country,year,region,population,life_expectancy,a5s,bpw,gdp_per_capita,gdp_per_day
0,Afghanistan,1800,Asia,3280000.0,28.21,53.142,7.00,603.0,1.650924
1,Afghanistan,1810,Asia,3280000.0,28.11,53.002,7.00,604.0,1.653662
2,Afghanistan,1820,Asia,3323519.0,28.01,52.862,7.00,604.0,1.653662
3,Afghanistan,1830,Asia,3448982.0,27.90,52.719,7.00,625.0,1.711157
4,Afghanistan,1840,Asia,3625022.0,27.80,52.576,7.00,647.0,1.771389
...,...,...,...,...,...,...,...,...,...
14735,Zimbabwe,2011,Africa,14255592.0,51.60,90.800,3.64,1626.0,4.451745
14736,Zimbabwe,2012,Africa,14565482.0,54.20,91.330,3.56,1750.0,4.791239
14737,Zimbabwe,2013,Africa,14898092.0,55.70,91.670,3.49,1773.0,4.854209
14738,Zimbabwe,2014,Africa,15245855.0,57.00,91.900,3.41,1773.0,4.854209


### Categorical variables
Next, let's summarize the categorical, non-numerical variables. For instance, we can identify how many unique regions we have in the data set.

First, to select a column, we use the notation `df['COLUMN_NAME']`.

In [30]:
df['name']

0      NORWOOD MEMORIAL AIRPORT, MA US
1      NORWOOD MEMORIAL AIRPORT, MA US
2      NORWOOD MEMORIAL AIRPORT, MA US
3      NORWOOD MEMORIAL AIRPORT, MA US
4      NORWOOD MEMORIAL AIRPORT, MA US
                    ...               
295                      NORTON, MA US
296                      NORTON, MA US
297                      NORTON, MA US
298                      NORTON, MA US
299                      NORTON, MA US
Name: name, Length: 300, dtype: object

Depending on your column name, you can also refer to the column with `df.column_name` as well.

In [31]:
df.station #this only works if the column 
#name only uses characters; every single column can be referenced the other way regardless of name

0      USW00054704
1      USW00054704
2      USW00054704
3      USW00054704
4      USW00054704
          ...     
295    USC00195984
296    USC00195984
297    USC00195984
298    USC00195984
299    USC00195984
Name: station, Length: 300, dtype: object

To identify unique entries in this column, we can use the `pd.unique()` function. 

In [32]:
pd.unique(df['name'])

array(['NORWOOD MEMORIAL AIRPORT, MA US', 'NATICK, MA US',
       'MAYNARD, MA US', 'READING, MA US', 'BLUE HILL LCD, MA US',
       'JAMAICA PLAIN, MA US', 'LAWRENCE, MA US',
       'SOUTH WEYMOUTH NAS, MA US', 'MARBLEHEAD, MA US',
       'MIDDLETON, MA US', 'BRIDGEWATER, MA US', 'GROVELAND, MA US',
       'MILFORD, MA US', 'BROCKTON, MA US',
       'BEVERLY MUNICIPAL AIRPORT, MA US', 'FRANKLIN, MA US',
       'HINGHAM, MA US', 'HAVERHILL, MA US', 'BOSTON, MA US',
       'BEDFORD HANSCOM FIELD, MA US', 'BEVERLY, MA US', 'LOWELL, MA US',
       'WALPOLE 2, MA US', 'LAWRENCE MUNICIPAL AIRPORT, MA US',
       'NORTON, MA US'], dtype=object)

We can also just use the `len()` function to see how many unique values we have.

In [33]:
len(pd.unique( df['name'] ))
#this shows that there are 25 unique names

25

### Numerical variables

Numerical columns can be summarized in several ways. Let's find the mean first.

To make things simpler, we'll just do calculations on the `population`, `life_expectancy`, and `babies_per_woman` columns. We can put those names in a `list` and then specify that list for the columns.

In [37]:
num_cols = [ 'date', 'temp', 'diurnal_temp_range', 'precip-total','snow-totals' ] # numerical columns

In [38]:
df.dtypes

station          object
name             object
date              int64
temp            float64
dtr             float64
precip-total    float64
snow-totals     float64
dtype: object

In [39]:
df.select_dtypes(include=['int64','float64'])

Unnamed: 0,date,temp,dtr,precip-total,snow-totals
0,1,25.9,19.7,3.43,
1,2,28.9,21.0,3.25,
2,3,36.4,21.5,4.45,
3,4,46.8,22.7,4.19,
4,5,56.4,24.9,3.68,
...,...,...,...,...,...
295,8,69.9,23.1,4.04,0.0
296,9,61.8,23.9,3.99,0.0
297,10,50.6,22.8,4.39,-7777.0
298,11,42.0,20.8,4.79,1.5


In [41]:
df_num = df.select_dtypes(exclude=['object'])

With this set of columns, we can run `.mean()` to find the mean of each column.

In [45]:
df_num.mean() #the mean of each of the columns
df_num.median()
df_num.std

<bound method NDFrame._add_numeric_operations.<locals>.std of      date  temp   dtr  precip-total  snow-totals
0       1  25.9  19.7          3.43          NaN
1       2  28.9  21.0          3.25          NaN
2       3  36.4  21.5          4.45          NaN
3       4  46.8  22.7          4.19          NaN
4       5  56.4  24.9          3.68          NaN
..    ...   ...   ...           ...          ...
295     8  69.9  23.1          4.04          0.0
296     9  61.8  23.9          3.99          0.0
297    10  50.6  22.8          4.39      -7777.0
298    11  42.0  20.8          4.79          1.5
299    12  31.8  20.0          4.74          9.7

[300 rows x 5 columns]>

If we want a larger variety of summary statistics, we can use the `.describe()` method.

In [46]:
df_num.describe()

Unnamed: 0,date,temp,dtr,precip-total,snow-totals
count,300.0,276.0,276.0,300.0,180.0
mean,6.5,49.424638,20.128986,4.068467,-384.763333
std,3.45782,15.70183,2.597781,0.464102,1700.634258
min,1.0,23.6,13.0,2.79,-7777.0
25%,3.75,34.675,18.275,3.75,0.0
50%,6.5,49.2,20.2,4.03,0.6
75%,9.25,64.675,21.825,4.39,9.2
max,12.0,74.3,26.4,5.58,18.9


We can also break down subgroupings of our data with the method `.groupby()`.

### Question

Using the gapminder data, use `.groupby()` to get summary statistics by region.

In [52]:
## your code here: 

regions = df2.groupby('region')
regions['life_expectancy'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Africa,4293.0,48.173829,12.264941,4.0,38.17,49.1,57.0,77.6
America,2673.0,60.397871,15.190684,24.56,52.54,65.74,71.9,81.7
Asia,4212.0,55.761026,16.233842,8.0,43.1975,60.055,68.5,83.2
Europe,3562.0,65.867973,13.899266,19.76,64.54,70.595,74.5275,83.3


### Accessing rows and specific entries

You can also to access a specific row using `df.loc[ROW, :]`. The colon specifies to select all columns for that row number.

We can use `.loc` to find the value of specific entries, as well.

## for loops

### Math

If we multiply a data frame by a single number, each value in the column will be muliplied by that value.

We can turn this into a new column by assigning to `df['new_col_name']`.

Numpy functions work very well with numerical columns.

This new column is now reflected in the data frame. 

We can also do math between columns, since they have the same length. Elements of the same row are added, substacted, multiplied, or divided. 


### Create your own data frame

To make your own data frame without a .csv, we use the function `pd.DataFrame()`. There are many ways to use this function to construct a data frame. 

Here, we show how to convert a dictionary of lists into a data frame. Each list will be its own column, and you need to make sure the lists are all the same length. The keys of each list should be the column names.

In [8]:
data_dict = {
    'a': [1, 3, 5],
    'b': ['apple', 'banana', 'apple'],
    'c': [-2., -3., -5.]
}

You can also use lists of lists or 2D NumPy arrays to create data frames. Each list will be a row, instead of a column, and you will need to specify the column name as another argument in `pd.DataFrame()` called `columns`.

In [9]:
data_list = [
    [1, 'apple', -2.],
    [3, 'banana', -3.],
    [5, 'apple', -5.]
]


Note: we need to save this as a variable to use it in the future.

### Export data frame as .csv

If you have made modifications to a data set in Python and want to export that to a new .csv, you can easily do that with the `.to_csv()` method that all pandas data frames have.

In [55]:
my_df = pd.DataFrame(data_list, columns=['a', 'b', 'c'])


#### Question: Putting it together

In assignment 2, we moved information gathered from some researchers into a nested data structure. Instead, transfer these data into a Pandas dataframe. Display the data frame, and export it as a .csv file.

As a reminder, each list is in the same order as the researchers name -> all of Haley McCann's data is at index `0`.

In [None]:
researchers = ['Haley McCann', 'Siena Welch', 'Jaylin Mercado', 'Ismael Hayden', 'Nina Bright']

temperatures = [29.75, 12.63, 31.58, 7.16, 32.51]

populations = [442, 336, 505, 913, 933]

dates = ['5/25/2022','3/18/2022','6/28/2022','11/11/2022','7/6/2023']

### Your code here:


## Resources

- [Pandas docs](https://pandas.pydata.org/docs/)
- [Pandas getting started](https://pandas.pydata.org/docs/getting_started/index.html#getting-started)
- [Pandas cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
- [PySpark for big data](https://spark.apache.org/docs/latest/api/python/)

This lesson is adapted from 
[Software Carpentry](http://swcarpentry.github.io/python-novice-gapminder/design/).