# Pandas

Let's practice with Pandas dataframes and series!

In [2]:
#Run these imports first
import numpy as np
import pandas as pd

print ("numpy version:", np.__version__)
print ("pandas version:", pd.__version__)

numy version: 1.21.5
pandas version: 1.5.2


## Series

Let's make a series, including one field with NaN (an empty value).


In [3]:
s = pd.Series([1,3,5,np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

In [5]:
# Accessing a member of the series:
# remember zero based index

s[4]

6.0

## Dataframes

Here we define dataframe.

![](../assets/images/pandas-series-3.png)

In [10]:
df = pd.DataFrame({'Month' : ['Jan', 'Feb', 'Mar', 'Apr'],
                    'Sales': [10, 20, 30, 40]})
df

Unnamed: 0,Month,Sales
0,Jan,10
1,Feb,20
2,Mar,30
3,Apr,40


### `df.info`

In [12]:
# df.info will give you info 

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Month   4 non-null      object
 1   Sales   4 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 192.0+ bytes


### `df.dtypes`

Find the type of each column

In [21]:
df.dtypes

Month    object
Sales     int64
dtype: object

### Update the df

In [14]:
# set sales for March = 300
## TODO : what is the row and column for 'Mar'
df.iloc[2,1] = 300
df

Unnamed: 0,Month,Sales
0,Jan,10
1,Feb,20
2,Mar,300
3,Apr,40


In [15]:
## TODO: Try setting some other Sales numbers
## Remember, the indexes for both row and column start at 0
## Try iloc [0,0] - which cell it modifies?

# your code here

## Load data from CSV file

We will use `pd.read_csv` function.  **This is the most common way we would create dataframes**

**TODO: Try to read local data and remote data, by adjusting `data_location` variable**

In [16]:
data_location = '../data/rainfall.csv'
# data_location =  'https://storage.googleapis.com/elephantscale-public/data/rainfall/rainfall.csv'
# data_location = 'https://raw.githubusercontent.com/elephantscale/datasets/master/rainfall/rainfall.csv'

rain = pd.read_csv(data_location)
rain

Unnamed: 0,City,Month,Rainfall
0,San Francisco,Jan,10.0
1,Seattle,Jan,30.0
2,Los Angeles,Jan,2.0
3,Seattle,Feb,20.0
4,San Francisco,Feb,4.0
5,Los Angeles,Feb,0.0
6,Seattle,Mar,22.0
7,San Francisco,Mar,4.0
8,Los Angeles,Mar,
9,Seattle,Apr,


# Naming Rows

We can give rows names instead of numbers.

In [None]:
rain2 = rain.set_index(rain['City']  + rain['Month'])
rain2

### Setting the index.

In Pandas we have something called the index.  Here's how we use the index to get rows:

```python
rain.loc[0]  #will get row at index '0' as a series
rain.loc[[0]] #will get row at index '0' as a one-row dataframe
```

Just like columns can be accessed by both number and name, rows also can be accessed by either number or name.  

By default, the index is simply the row number starting from zero, but this can be changed or overridden.

``` rain.set_index("colname")
```

In [None]:
rain.set_index(rain['City']  + rain['Month'])

### Pandas and changes in-place

**Most** (but not all) changes to the dataframe do not happen in-place.  This means that they return a mutated copy of the data, but don't touch the original data.

Let's try referencing the rainfall after setting the dataframe index as above.

In [None]:
# Let us try referencing this -- Note: it won't work.
rain.loc['San FranciscoJan']  # ERROR!

set_index, and many other functions returns a mutated dataframe and does NOT change it in-place. If we want to apply the change we can write on top of the old dataframe.

In [None]:
rain = rain.set_index(rain['City']  + rain['Month'])
                    
rain

In [None]:
rain.loc['San FranciscoJan']  #Should Work Now

Most functions can in fact change data in place with the optional inPlace parameter.

```python
 rain.set_index(rain['City']  + rain['Month'], inplace=True)
```
