# <center>Class 4</center>

## Pandas

In [None]:
import pandas as pd
import numpy as np
import os

In [None]:
import warnings
warnings.filterwarnings('ignore')

`Pandas dataframes` are your tool for working with tabular data. Take it as a programmable version of an Excel sheet. The problem with Excel is that it is too good so it is difficult to motivate users to try something new. Despite its flexibility and user-friendly interface Excel does have limitations, which Pandas can solve. 

Pandas started out as the Python version of R dataframes. Now it has its own ecosystem.

### Pandas Series

Pandas is a columnar structure where each column is essentially a `Pandas Series`. They are strictly 1-dimensional and can contain any data type (integers, strings, floats, objects, etc), including a mix of them. Series can be created from a scalar, a list, ndarray or dictionary using `pd.Series()` (note the captial **“S”**).

By default, series are labelled with indices starting from 0. For example:

In [None]:
s = pd.Series(data = [-5, 1.3, 21, 6, 3])

In [None]:
s

But you can add a custom index:

In [None]:
pd.Series(data = [-5, 1.3, 21, 6, 3],
          index = ['a', 'b', 'c', 'd', 'e'])

Create a series from a dictionary. Your keys are the indices, and your values are yor series values.

In [None]:
pd.Series(data = {'a': 10, 'b': 20, 'c': 30})

A Pandas series can have a heterogenous data strucure.

In [None]:
pd.Series([1, 'a', 2.0, np.random.rand(1)])

In [None]:
 np.random.rand(1)

Question: How can I modify the function above to have the random number itself instead of a numpy array as the last value of the series?

In [None]:
pd.Series([1, 'a', 2.0,  np.random.rand(1)[0] ])

### Pandas DataFrames

 `DataFrames` are really just Series stuck together! Think of a DataFrame as a dictionary of series, with the _“keys”_ being the column labels and the _“values”_ being the series data.

#### Creating DataFrames

Dataframes can be created using `pd.DataFrame()` (note the capital “D” and “F”). Like series, index and column labels of dataframes are labelled starting from 0 by default. Note that we are using a _list of lists_ as the data input.

In [None]:
pd.DataFrame([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

Add `index` and `column` names.

In [None]:
pd.DataFrame([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]],
             index = ["R1", "R2", "R3"],
             columns = ["C1", "C2", "C3"])

Use  _dictionary_ as input.

In [None]:

pd.DataFrame(
    {
        "C1": [1, 2, 3],
        "C2": ['A', 'B', 'C']
    } 
    ,     index=["R1", "R2", "R3"]
)

#### Slicing and indexing

There are several main ways to select data from a DataFrame:

- []
- .loc[]
- .iloc[]
- Boolean indexing
- .query()

In [None]:
df = pd.DataFrame(
    {
        "Name": ["Tom", "Mike", "Tiffany"],
        "Language": ["Python", "Python", "R"],
        "Courses": [5, 4, 7]
    })
df

**Using `[]`**

In [None]:
df['Name']  # returns a series

In [None]:
df[['Name']]  # returns a dataframe!

In [None]:
df[['Name', 'Language']]

In [None]:
df[0:1] 

In [None]:
df[:1] 

In [None]:
df[1:] 

**Indexing with `.loc` and `.iloc`**

Pandas created the methods `.loc[]` and `.iloc[]` as more flexible alternatives for accessing data from a dataframe. Use `df.iloc[]` for _indexing with integers_ and `df.loc[]` for _indexing with labels_.

In [None]:
df.iloc[0]  # returns a series

In [None]:
df.iloc[0:2]  # slicing returns a dataframe

In [None]:
df.iloc[2, 1]  # returns the indexed object

In [None]:
df

In [None]:
df.iloc[[0, 1], [1, 2]]  # returns a dataframe

In [None]:
df.loc[:, 'Name'] # series

In [None]:
df.loc[:, ['Name']] # dataframe

In [None]:
df.loc[:, ['Name','Language', 'Language']]

In [None]:
df.loc[[0, 2], ['Language', 'Courses']]

Get `index` and `column` names

In [None]:
df.index

You can also redeine the vaules of the index by overwriting the dataframe's `index` attribute.

In [None]:
df.index = [3,4,5]

In [None]:
df

In [None]:
df.index

In [None]:
type(df.index)

In [None]:
df.columns

In [None]:
type(df.columns)

**Boolean indexing**

In [None]:
df[df['Courses'] > 5]

In [None]:
df[(df['Name'] == "Tom") & (df.Language == 'Python')]

In [None]:
df['Name'] == 'Tom'

**Indexing with `.query()`**

In [None]:
df.query("Courses > 4 & Language == 'Python'") # note the mixed use of double and single quotes. Why is that? 

Query also allows you to reference variable in the current workspace using the `@` symbol:

In [None]:
course_threshold = 4
df.query("Courses > @course_threshold")

### Reading and inspecting data

In [None]:
df = pd.read_csv(os.path.join(os.pardir, 'data', 'titanic.csv'))

In [None]:
df.head()

In [None]:
df.tail()

The `.T` attribute (also available as the transpose() method) helpy you to inspect rows with many columns.

In [None]:
# closer look at a few rows
df.iloc[0:2].T

In [None]:
df.info()

- Pclass: passenger class
- SibSp: number of siblings & spouses aboard
- Parch: number of parents & children aboard
- Fare: ticket price
- Cabin: room id, if any
- Embarked: port of departure (Cherbourg, Queenstown, Southampton)

In [None]:
df.describe()

In [None]:
df.describe().T

#### Some simple metrics

There are two ways to access individual columns:
```python
df['Survived'] # refer it as a column name in brakcets
df.Survived # refer it as an attribute of the dataframe
```

In [None]:
df.Survived.value_counts()

In [None]:
df.Class.value_counts()

In databases, 'Class' is a *dimension* and 'Survivied' is a *metric*. You can aggregate *metrics* by *dimensions*. In databases these aggregations are done through a `GROUP BY` caluse. `groupby` is also availabel in Pandas.

In [None]:
df.groupby(by = 'Class')['Survived'].mean()

In [None]:
df.groupby(by = 'Sex')['Survived'].mean()

In [None]:
df.groupby(by = ['Class', 'Sex'])['Survived'].mean()

<br> How can we interpet these `mean` values?

In [None]:
df.isna().sum()

Let's assume that where 'Survived' is missing then the person did not survive. We can now replace missing values with zero (did not survive). The `inplace` clause modifies the dataframe *inplace*. It means that Pandas DataFrames are *mutable*

In [None]:
df.Survived.fillna(0, inplace = True)

In [None]:
df.isna().sum()

In [None]:
df.groupby(by = ['Class', 'Sex'])['Survived'].mean()

In [None]:
df.groupby(["Pclass"]).agg({"Fare": [np.mean, np.var], "Age": np.mean})

You can use `map()` to format the numbers in the dataframe.

In [None]:
df.groupby(["Pclass"]).agg({"Fare": [np.mean, np.var], "Age": np.mean}).map('{:,.1f}'.format)

#### Filtering

In [None]:
df[df.Embarked == 'C'] # embarked is Cherbourg

In [None]:
df[(df.Embarked == 'C') & (df.Pclass == 3)]

### Modifying dataframes

#### Creating new columns

In [None]:
df['Is_adult'] = df.Age > 18

In [None]:
df.iloc[0].T

In [None]:
df.Is_adult.sum()

In [None]:
def is_english(value):
    if 'England' in value:
        return 1
    else:
        return 0

In [None]:
df['Is_English'] = df.Hometown.apply(is_english) # This will throw an error. Why?

In [None]:
df.Hometown.isna().sum()

In [None]:
df.Hometown.isna() # it looks like a mask!

In [None]:
df[df.Hometown.isna()]

In [None]:
df.loc[557, 'Hometown']

In [None]:
type(df.loc[557, 'Hometown'])

In [None]:
def is_english(value):
    if 'England' in str(value): # NaNs are umbers so we need to cast them into strings to make our function work for each entry
        return 1
    else:
        return 0

In [None]:
df['Is_English'] = df.Hometown.apply(is_english)

In [None]:
df['Is_English'].sum()

#### Sorting and resetting indices

In [None]:
# this operation will be executed 'inplace'
df.sort_values(by = 'Age', inplace = True)
df.head()

In [None]:
# instead of inplace we can also recreate the dataframe + we can define acending or descending sorting order

df = df.sort_values(by = ['Fare', 'Age'], ascending = [False,  True])
df.head()

Note: Pandas dataframes are *mutable* so many operations can be done *inplace*. *Spark* dataframes, for instance, are immutable, so you need to recreate them every time you modifiy them. 

In [None]:
df.sort_values(by = 'Age', inplace = True)
df = df.sort_values(by = 'Age')

In [None]:
df.reset_index(drop = True, inplace = True)
df.head()

In [None]:
df.tail()

### Cleaning & modifing tables using Pandas

We are cleaning our favourite Vienna hotels table. 

In [None]:
df_raw = pd.read_csv(os.path.join(os.pardir, 'data', 'hotelbookingdata-vienna.csv'))

In [None]:
df_raw.info()

In [None]:
df_raw.head().T

Getting distance from city center as a float.

In [None]:
df_raw.center1distance.iloc[0]

In [None]:
df_raw.center1distance.iloc[0].split()

In [None]:
df_raw.center1distance.iloc[0].split()[0]

In [None]:
float(df_raw.center1distance.iloc[0].split()[0])

Putting it all together:

or:

In [None]:
df_raw["distance"] = df_raw["center1distance"].apply(lambda x: float(x.split()[0]))

In [None]:
df_raw.distance

In [None]:
df_raw.distance.plot(kind = 'hist', bins = 26, rwidth = 0.9);

In [None]:
df_raw.accommodationtype.iloc[0]

In [None]:
df_raw.accommodationtype.iloc[0].split('@')

In [None]:
df_raw.accommodationtype.iloc[0].split('@')[1]

In [None]:
df_raw['accommodation_type'] = df_raw.accommodationtype.apply(lambda x: x.split('@')[1])

In [None]:
df_raw.accommodation_type.value_counts()

In [None]:
df_raw.guestreviewsrating.isna().sum()

In [None]:
df_raw.guestreviewsrating.iloc[0]

In [None]:
df_raw.guestreviewsrating.iloc[0].split()

In [None]:
df_raw.guestreviewsrating.iloc[0].split()[0]

In [None]:
isinstance(df_raw.guestreviewsrating.iloc[0].split()[0], str)

In [None]:
float(df_raw.guestreviewsrating.iloc[0].split()[0])

In [None]:
df_raw.guestreviewsrating.value_counts()

In [None]:
df_raw['rating'] = df_raw.guestreviewsrating.apply(lambda x: float(x.split()[0]) if isinstance(x, str) else None)

In [None]:
# Check results
df_raw.rating.value_counts()

In [None]:
df_raw.rating.isna().sum()

Rename columns

In [None]:
df_raw.rename(
    columns = {
        "rating2_ta": "ratingta",
        "rating2_ta_reviewcount": "ratingta_count",
        "addresscountryname": "country",
        "s_city": "city",
        "starrating": "stars",
    }, 
    inplace = True
)

In [None]:
df_raw.head()

In [None]:
df_raw[['hotel_id', 'neighbourhood',  'distance', 'stars', 'rating', 'ratingta_count', 'accommodation_type',
       'price', 'city']]   