# 02-Pandas basics

This notebook gives an introduction to `pandas`, which is the Python package for data handling and manipulation.

https://pandas.pydata.org/docs/getting_started/index.html

Before we can use `pandas`, we must first import it in to our program/Jupyter notebook. It is convention to import `pandas` as `pd`. 

In [None]:
import pandas as pd

## Series and DataFrame

`pandas` offers data types for tabular data.

<img src="images/table.png" width = "50%" align="left"/>

`pandas` has two data types: `Series` and `DataFrame`. 

A `Series` is a column in table, while a `DataFrame` is a collecton of columns. 

We create a `Series` by passing a list of values to the `Series` function.

In [None]:
name_lst = ['Ole', 'Jenny', 'Chang', 'Jonas']

name_lst

In [None]:
series = pd.Series(name_lst)

In [None]:
series

A `Series` has an `index` attribute.

In [None]:
series.index

However, we usually work with two-dimensional data, i.e. several variables for each observation. We can store two-dimensional data in a `pandas` `DataFrame`.

First, we create a dictionary with the keys as the column names and the values as the data.

In [None]:
grade_dict = {'Name'  : ['Ole', 'Jenny', 'Chang', 'Jonas'],
              'Score' : [65.0, 58.0, 79.0, 95.0],
              'Pass'  : ['yes', 'no', 'yes', 'yes']}

grade_dict

Second, we create a `DataFrame` by giving the dictionary with column names and values to the `DataFrame` function.

In [None]:
df = pd.DataFrame(grade_dict)

In [None]:
df

<div class="alert alert-info">
<h3> Your turn</h3>
    <p> Take the dictionary <code>temps_dict</code> that you created in the mandatory exercise on day 1, and convert it to a <code>DataFrame</code> called <code>temps_df</code>.
</div>

**Solution**

<details>
    
<summary> Click to expand!</summary>
<p> 

```c#
temps_dict = {
    'oslo' : [0, -4, -3, 0, 3, 5, 4],
    'bergen' : [4, 3, 4, 3, 3, 7, 8],
    'trondheim' : [0, -1, -3, -2, -2, -5, -6]
}

temps_df = pd.DataFrame(temps_dict)
```

</p>
</details> 

A `DataFrame` has both an `index` and a `columns` attribute.

In [None]:
df.index

In [None]:
df.columns

In general, we select rows and columns from a `DataFrame` by using the index operator `[]`.

To select a column, we place the column name inside `[]`.

In [None]:
df['Name']

To select multiple columns, we place a *list* of column names inside `[]`.

In [None]:
df[['Name', 'Score']]

<div class="alert alert-info">
<h3> Your turn</h3>
    <p> Select the columns with the daily temperature observations for Oslo and Bergen from <code>temps_df</code>.
</div>

**Solution**

<details>
    
<summary> Click to expand!</summary>
<p> 

```c#
temps_df[['oslo', 'bergen']]
```

</p>
</details> 

To select rows, we combine the index operator `[]` with the `loc` attribute.

In [None]:
df.loc[3]

We can select the value in a specific row and column by specifying both the row label and column label inside the square brackets.

In [None]:
df.loc[0, 'Name']

We can also *slice* the rows the same way as we did with strings and lists. Notice that when slicing rows, it is no longer necessary to use the `loc` atribute.

In [None]:
df[:2]

<div class="alert alert-info">
<h3> Your turn</h3>
    <p> Show two different ways of selecting the last row in <code>temps_df</code>.
</div>

**Solution**

<details>
    
<summary> Click to expand!</summary>
<p> 

```c#
# extract last row using loc command
temps_df.loc[6]
 
# extract last row by slicing
temps_df[6:]
```

</p>
</details> 

Instead of simply displaying the subset of rows/columns, we can save the subset as a new `DataFrame` by assigning it a variable name.

However, notice that when creating a subset of a `DataFrame`, `pandas` is not actually returning a new `DataFrame` with the selected rows/columns. Instead, `pandas` is displaying the original `DataFrame` with some rows/columns hidden.

In [None]:
df[:2]

In order to actually create a new `DataFrame`, we need to append `copy` at the end of the subset.

In [None]:
df_subset = df[:2].copy()

df_subset

<div class="alert alert-info">
<h3> Your turn</h3>
    <p> Store the last row in <code>temps_df</code> in a new variable called <code>sunday_df</code>.
</div>

**Solution**

<details>
    
<summary> Click to expand!</summary>
<p> 

```c#
sunday = temps_df[6:].copy()
```

</p>
</details> 

## Import and save files

The file `titanic.csv` contains information on 891 of the passengers on the Titanic.

The file consists of the following data columns:

* PassengerId: Id of every passenger.
* Survived: This column has value 0 and 1 (0 for not survived and 1 for survived).
* Pclass: There are 3 classes (Class 1, Class 2 and Class 3).
* Name: Name of passenger.
* Sex: Gender of passenger.
* Age: Age of passenger.
* Fare: Ticket price paid by passenger

We can import the file by supplying the file name to the `read_csv` function. 

As a default, `read_csv` will look for the file in the same folder as the notebook. 

However, if the file is in a subfolder, we must specify also the path to the file (i.e. the name of the subfolder).

In [None]:
titanic = pd.read_csv('data/titanic.csv')

In [None]:
titanic

Notice that `read_csv` assumes that the values in the file is seperated by a comma `,`. We can change this by giving a new value to the otional parameter `sep`. 

A pipe-delimited version of the file can be read by setting `sep = '|'`.

In [None]:
titanic_pipe = pd.read_csv('data/titanic_pipe.csv', sep = '|')

titanic_pipe

`read_csv` has many optional parameters that we can pass arguments to in order to customize how we import the file.

For instance, we can give a list of column names to `usecols` in order to import only a subset of the columns.

In [None]:
titanic_subset = pd.read_csv('data/titanic.csv', usecols = ['PassengerId', 'Survived', 'Name'])

titanic_subset.head()

See the [function documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) for an overview of the parameters in `read_csv`.

We can save the data as a spreadsheet in the `data` folder using the `to_excel` function.

In [None]:
titanic.to_excel('data/titanic.xlsx')

`to_excel` has many optional parameters that we can change.

We can for instance specify the parameters `sheet_name` and `index`.

In [None]:
titanic.to_excel('data/titanic.xlsx', sheet_name = 'passengers', index = False)

See the [function documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html) for an overview of the parameters in `to_excel`.

Notice that if we wanted to save the file as a CSV file, we have to use the `to_csv` function instead. See the [function documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html) for an overview of the parameters in `to_csv`.

<div class="alert alert-info">
<h3> Your turn</h3>
    <p> Save <code>temps_df</code> as <code>temperatures.xlsx</code> using <code>to_excel</code> and as <code>temperatures.csv</code> using <code>to_csv</code>.
</div>

**Solution**

<details>
    
<summary> Click to expand!</summary>
<p> 

```c#
temps_df.to_excel('data/temperatures.xlsx', index = False)
temps_df.to_csv('data/temperatures.csv', index = False)
```

</p>
</details> 

We can then import the excel file using the `read_excel` function.

In [None]:
# (this will only work as long as we have created the excel file above)
titanic = pd.read_excel('data/titanic.xlsx')

titanic

See the [function documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html) for an overview of the parameters in `read_excel`.

<div class="alert alert-info">
<h3> Your turn</h3>
    <p> Use <code>read_excel</code> to import <code>temperatures.xlsx</code>.     
</div>

**Solution**

<details>
    
<summary> Click to expand!</summary>
<p> 

```c#
temps_df = pd.read_excel('data/temperatures.xlsx')
 
temps_df
```

</p>
</details> 

## Exploring the data

After a file has been imported, it is important to explore the data in order to get a sense of what is going on in the data, and also to make sure that the file was imported correctly.

`head` and `tail` show the five first and five last rows.

In [None]:
titanic.head()

In [None]:
titanic.tail()

`info` displays the data types of the columns (notice that 'object' indicates a string).

In [None]:
titanic.info()

`describe` displays descriptive statistics for the *numeric* columns.

In [None]:
titanic.describe()

<div class="alert alert-info">
<h3> Your turn</h3>
    <p> In addition to <code>describe</code>, there are several functions that we can apply on columns to calculate statistics such as the average value, standard deviation, maximum value etc. See if you can find a <code>pandas</code> function to calculate the median age of the passengers in <code>titanic</code>.
</div>

**Solution**

<details>
    
<summary> Click to expand!</summary>
<p> 

```c#
titanic['Age'].median()
```

</p>
</details> 

`nunique` and `unique` shows the *number of unique values* and the *unique values* in a specific column.

In [None]:
titanic['Survived'].nunique()

In [None]:
titanic['Survived'].unique()

`value_counts` counts the number of observations for each unique value in a column.

In [None]:
titanic['Survived'].value_counts()

<div class="alert alert-info">
<h3> Your turn</h3>
    <p> How many passengers in <code>titanic</code> were in first class?
</div>

**Solution**

<details>
    
<summary> Click to expand!</summary>
<p> 

```c#
first_class = titanic['Pclass'].value_counts().loc[1]

print('There were ' + str(first_class) + ' passengers in first class.')
```

</p>
</details> 

`corr` calculates the correlation coefficient between all of the numeric columns in a `DataFrame`.

In [None]:
titanic.corr()

#### Missing data

Missing data in `pandas` is denoted by the value `NaN`, which stand for 'not a number'. 

In [None]:
titanic.tail()

We can count the total number of missing values in each column in a `DataFrame` by combining the `isna` and `sum` functions.

`isna` creates boolean values (`True`/`False`) for each cell in a `DataFrame` indicating whether or not cell has a missing value.

In [None]:
titanic.isna()

We can then use `sum` to count the number of `True` in each column.

In [None]:
titanic.isna().sum()

## Converting `dtype`

When we import files, `pandas` infer the data types of the columns in the file. 

However, sometimes we want to change the `dtype` of the columns. Either because `pandas` got it wrong, or because there are other data types that are more appropriate for our analysis. 

The file `AAPL.csv` contains data on stock prices and trading volume for Apple on every weekday in 2020.

In [None]:
apple = pd.read_csv('data/AAPL.csv')

apple.head()

In [None]:
len(apple)

In [None]:
apple.tail()

`read_csv` imported the price data as floats, the trading volumes as integers, and the dates as strings.

In [None]:
apple.info()

We can apply `astype` on a column in order to change the `dtype` of a column to `str`, `float` or `int`.

In order to modify the `DateFrame`, we must set the old column equal to the updated column using the `=` operator.

In [None]:
apple['Volume'] = apple['Volume'].astype('str')

apple.head()

In [None]:
apple.info()

In [None]:
apple['Volume'] = apple['Volume'].astype(float)

apple.head()

In [None]:
apple['Volume'] = apple['Volume'].astype(int)

apple.head()

In [None]:
apple.info()

<div class="alert alert-info">
<h3> Your turn</h3>
    <p> Notice that the values in the <code>Age</code> column in <code>titanic</code> are floats and not integers. However, we normally think of ages as whole numbers and not numbers with decimals. Try and convert the <code>Age</code> column to integers. Why does this not work?
</div>

**Solution**

<details>
    
<summary> Click to expand!</summary>
<p> 

```c#
# try to convert to integers...
titanic['Age'] = titanic['Age'].astype(int)
    
# but it will throw a ValueError since the Age column contains NaN
# these are non-numbers and we cannot convert a non-number to an integer

```

</p>
</details> 

However, what data type is appropiate for the `Date` column? `str`, `float` or `int`?

In [None]:
apple.loc[0, 'Date']

Although we can interpret the `Date` column as a string, `pandas` was actually developed in order to handle time series data (especially financial time series). `pandas` therefore comes with an additional data type known as `datetime`.

`to_datetime` will convert a series of dates to `datetime`.

In [None]:
pd.to_datetime(apple['Date'])

In [None]:
apple['Date'] = pd.to_datetime(apple['Date'])

In [None]:
apple.info()

The `Date` column is now `datetime`, meaning that each value in the column is interpreted as a *timestamp*.

In [None]:
apple.loc[0, 'Date']

## Mandatory exercise, part 1

The file <code>mpg.xlsx</code> (in the data folder) contains observations on fuel economy and 6 additional attributes for 398 different car models. The column <code>mpg</code> is a measure of the car's fuel economy, i.e. the number of miles per gallon of petrol.
        
Import the file as a <code>DataFrame</code> and answer the following questions:

1. Which columns in the dataframe are strings?


2. What is the average number of miles per gallon of the car models in the data?


3. What are the unique number of cylinders observed in the data?


4. How many of the car models in the data were from Europe?


5. What is the correlation between cars' fuel economy and horsepower?


6. Are there any missing observations in the data?