# 02-Pandas basics

This notebook gives an introduction to `pandas`, which is the Python package for data handling and manipulation.

https://pandas.pydata.org/docs/getting_started/index.html

Before we can use `pandas`, we must first import it in to our program/Jupyter notebook. It is convention to import `pandas` as `pd`. 

In [1]:
import pandas as pd

## Series and DataFrame

`pandas` offers data types for tabular data.

<img src="images/table.png" width = "50%" align="left"/>

`pandas` has two data types: `Series` and `DataFrame`. 

A `Series` is a column in table, while a `DataFrame` is a collecton of columns. 

We create a `Series` by passing a list of values to the `Series` function.

In [2]:
name_lst = ['Ole', 'Jenny', 'Chang', 'Jonas']

name_lst

['Ole', 'Jenny', 'Chang', 'Jonas']

In [3]:
series = pd.Series(name_lst)

In [4]:
series

0      Ole
1    Jenny
2    Chang
3    Jonas
dtype: object

A `Series` has an `index` attribute.

In [5]:
series.index

RangeIndex(start=0, stop=4, step=1)

However, we usually work with two-dimensional data, i.e. several variables for each observation. We can store two-dimensional data in a `pandas` `DataFrame`.

First, we create a dictionary with the keys as the column names and the values as the data.

In [6]:
grade_dict = {'Name'  : ['Ole', 'Jenny', 'Chang', 'Jonas'],
              'Score' : [65.0, 58.0, 79.0, 95.0],
              'Pass'  : ['yes', 'no', 'yes', 'yes']}

grade_dict

{'Name': ['Ole', 'Jenny', 'Chang', 'Jonas'],
 'Score': [65.0, 58.0, 79.0, 95.0],
 'Pass': ['yes', 'no', 'yes', 'yes']}

Second, we create a `DataFrame` by giving the dictionary with column names and values to the `DataFrame` function.

In [7]:
df = pd.DataFrame(grade_dict)

In [8]:
df

Unnamed: 0,Name,Score,Pass
0,Ole,65.0,yes
1,Jenny,58.0,no
2,Chang,79.0,yes
3,Jonas,95.0,yes


<div class="alert alert-info">
<h3> Your turn</h3>
    <p> Take the dictionary <code>temps_dict</code> that you created in the mandatory exercise on day 1, and convert it to a <code>DataFrame</code> called <code>temps_df</code>.
</div>

**Solution**

<details>
    
<summary> Click to expand!</summary>
<p> 

```c#
temps_dict = {
    'oslo' : [0, -4, -3, 0, 3, 5, 4],
    'bergen' : [4, 3, 4, 3, 3, 7, 8],
    'trondheim' : [0, -1, -3, -2, -2, -5, -6]
}

temps_df = pd.DataFrame(temps_dict)
```

</p>
</details> 

In [9]:
temp_dict = {"Oslo": [0,-4,-3,0,3,5,4],
            "Bergen": [4,3,4,3,3,7,8],
            "Trondheim": [0,-1,-3,-2,-2,-5,-6]}

temps_df = pd.DataFrame(temp_dict)
temps_df

Unnamed: 0,Oslo,Bergen,Trondheim
0,0,4,0
1,-4,3,-1
2,-3,4,-3
3,0,3,-2
4,3,3,-2
5,5,7,-5
6,4,8,-6


A `DataFrame` has both an `index` and a `columns` attribute.

In [10]:
df.index

RangeIndex(start=0, stop=4, step=1)

In [11]:
df.columns


Index(['Name', 'Score', 'Pass'], dtype='object')

In general, we select rows and columns from a `DataFrame` by using the index operator `[]`.

To select a column, we place the column name inside `[]`.

In [12]:
df['Name']

0      Ole
1    Jenny
2    Chang
3    Jonas
Name: Name, dtype: object

To select multiple columns, we place a *list* of column names inside `[]`.

In [13]:
df[['Name', 'Score']]

Unnamed: 0,Name,Score
0,Ole,65.0
1,Jenny,58.0
2,Chang,79.0
3,Jonas,95.0


<div class="alert alert-info">
<h3> Your turn</h3>
    <p> Select the columns with the daily temperature observations for Oslo and Bergen from <code>temps_df</code>.
</div>

**Solution**

<details>
    
<summary> Click to expand!</summary>
<p> 

```c#
temps_df[['oslo', 'bergen']]
```

</p>
</details> 

In [14]:
temps_df[["Oslo","Bergen"]]

Unnamed: 0,Oslo,Bergen
0,0,4
1,-4,3
2,-3,4
3,0,3
4,3,3
5,5,7
6,4,8


To select rows, we combine the index operator `[]` with the `loc` attribute.

In [15]:
df.loc[3]

Name     Jonas
Score     95.0
Pass       yes
Name: 3, dtype: object

We can select the value in a specific row and column by specifying both the row label and column label inside the square brackets.

In [16]:
df.loc[0, 'Name']

'Ole'

We can also *slice* the rows the same way as we did with strings and lists. Notice that when slicing rows, it is no longer necessary to use the `loc` atribute.

In [17]:
df[:2]

Unnamed: 0,Name,Score,Pass
0,Ole,65.0,yes
1,Jenny,58.0,no


<div class="alert alert-info">
<h3> Your turn</h3>
    <p> Show two different ways of selecting the last row in <code>temps_df</code>.
</div>

**Solution**

<details>
    
<summary> Click to expand!</summary>
<p> 

```c#
# extract last row using loc command
temps_df.loc[6]
 
# extract last row by slicing
temps_df[6:]
```

</p>
</details> 

In [18]:
temps_df.loc[6]

temps_df[6:]

Unnamed: 0,Oslo,Bergen,Trondheim
6,4,8,-6


Instead of simply displaying the subset of rows/columns, we can save the subset as a new `DataFrame` by assigning it a variable name.

However, notice that when creating a subset of a `DataFrame`, `pandas` is not actually returning a new `DataFrame` with the selected rows/columns. Instead, `pandas` is displaying the original `DataFrame` with some rows/columns hidden.

In [19]:
df[:2]

Unnamed: 0,Name,Score,Pass
0,Ole,65.0,yes
1,Jenny,58.0,no


In order to actually create a new `DataFrame`, we need to append `copy` at the end of the subset.

In [20]:
df_subset = df[:2].copy()

df_subset

Unnamed: 0,Name,Score,Pass
0,Ole,65.0,yes
1,Jenny,58.0,no


<div class="alert alert-info">
<h3> Your turn</h3>
    <p> Store the last row in <code>temps_df</code> in a new variable called <code>sunday_df</code>.
</div>

**Solution**

<details>
    
<summary> Click to expand!</summary>
<p> 

```c#
sunday = temps_df[6:].copy()
```

</p>
</details> 

In [21]:
sunday_df = temps_df[6:].copy()

sunday_df

Unnamed: 0,Oslo,Bergen,Trondheim
6,4,8,-6


## Import and save files

The file `titanic.csv` contains information on 891 of the passengers on the Titanic.

The file consists of the following data columns:

* PassengerId: Id of every passenger.
* Survived: This column has value 0 and 1 (0 for not survived and 1 for survived).
* Pclass: There are 3 classes (Class 1, Class 2 and Class 3).
* Name: Name of passenger.
* Sex: Gender of passenger.
* Age: Age of passenger.
* Fare: Ticket price paid by passenger

We can import the file by supplying the file name to the `read_csv` function. 

As a default, `read_csv` will look for the file in the same folder as the notebook. 

However, if the file is in a subfolder, we must specify also the path to the file (i.e. the name of the subfolder).

In [22]:
titanic = pd.read_csv('data/titanic.csv')

In [23]:
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Fare
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,7.2500
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,7.9250
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1000
4,5,0,3,"Allen, Mr. William Henry",male,35.0,8.0500
...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,13.0000
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,30.0000
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,23.4500
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,30.0000


Notice that `read_csv` assumes that the values in the file is seperated by a comma `,`. We can change this by giving a new value to the otional parameter `sep`. 

A pipe-delimited version of the file can be read by setting `sep = '|'`.

In [24]:
titanic_pipe = pd.read_csv('data/titanic_pipe.csv', sep = '|')

titanic_pipe

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Fare
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,7.2500
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,7.9250
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1000
4,5,0,3,"Allen, Mr. William Henry",male,35.0,8.0500
...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,13.0000
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,30.0000
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,23.4500
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,30.0000


`read_csv` has many optional parameters that we can pass arguments to in order to customize how we import the file.

For instance, we can give a list of column names to `usecols` in order to import only a subset of the columns.

In [25]:
titanic_subset = pd.read_csv('data/titanic.csv', usecols = ['PassengerId', 'Survived', 'Name'])

titanic_subset.head()

Unnamed: 0,PassengerId,Survived,Name
0,1,0,"Braund, Mr. Owen Harris"
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,3,1,"Heikkinen, Miss. Laina"
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,5,0,"Allen, Mr. William Henry"


See the [function documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) for an overview of the parameters in `read_csv`.

We can save the data as a spreadsheet in the `data` folder using the `to_excel` function.

In [26]:
titanic.to_excel('data/titanic.xlsx')

`to_excel` has many optional parameters that we can change.

We can for instance specify the parameters `sheet_name` and `index`.

In [27]:
titanic.to_excel('data/titanic.xlsx', sheet_name = 'passengers', index = False)

See the [function documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html) for an overview of the parameters in `to_excel`.

Notice that if we wanted to save the file as a CSV file, we have to use the `to_csv` function instead. See the [function documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html) for an overview of the parameters in `to_csv`.


<div class="alert alert-info">
<h3> Your turn</h3>
    <p> Save <code>temps_df</code> as <code>temperatures.xlsx</code> using <code>to_excel</code> and as <code>temperatures.csv</code> using <code>to_csv</code>.
</div>

**Solution**

<details>
    
<summary> Click to expand!</summary>
<p> 

```c#
temps_df.to_excel('data/temperatures.xlsx', index = False)
temps_df.to_csv('data/temperatures.csv', index = False)
```

</p>
</details> 

In [28]:
temps_df.to_excel("data/temperatures.xlsx")
temps_df.to_csv("data/temperatures.csv")

We can then import the excel file using the `read_excel` function.

In [29]:
# (this will only work as long as we have created the excel file above)
titanic = pd.read_excel('data/titanic.xlsx')

titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Fare
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,7.2500
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,7.9250
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1000
4,5,0,3,"Allen, Mr. William Henry",male,35.0,8.0500
...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,13.0000
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,30.0000
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,23.4500
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,30.0000


See the [function documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html) for an overview of the parameters in `read_excel`.

<div class="alert alert-info">
<h3> Your turn</h3>
    <p> Use <code>read_excel</code> to import <code>temperatures.xlsx</code>.     
</div>

**Solution**

<details>
    
<summary> Click to expand!</summary>
<p> 

```c#
temps_df = pd.read_excel('data/temperatures.xlsx')
 
temps_df
```

</p>
</details> 

In [30]:
temps_df = pd.read_excel('data/temperatures.xlsx')
temps_df

Unnamed: 0.1,Unnamed: 0,Oslo,Bergen,Trondheim
0,0,0,4,0
1,1,-4,3,-1
2,2,-3,4,-3
3,3,0,3,-2
4,4,3,3,-2
5,5,5,7,-5
6,6,4,8,-6


## Exploring the data

After a file has been imported, it is important to explore the data in order to get a sense of what is going on in the data, and also to make sure that the file was imported correctly.

`head` and `tail` show the five first and five last rows.

In [31]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Fare
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,7.25
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,7.925
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,8.05


In [32]:
titanic.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Fare
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,13.0
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,30.0
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,23.45
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,30.0
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,7.75


`info` displays the data types of the columns (notice that 'object' indicates a string).

In [33]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   Fare         891 non-null    float64
dtypes: float64(2), int64(3), object(2)
memory usage: 48.9+ KB


`describe` displays descriptive statistics for the *numeric* columns.

In [34]:
titanic.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,Fare
count,891.0,891.0,891.0,714.0,891.0
mean,446.0,0.383838,2.308642,29.699118,32.204208
std,257.353842,0.486592,0.836071,14.526497,49.693429
min,1.0,0.0,1.0,0.42,0.0
25%,223.5,0.0,2.0,20.125,7.9104
50%,446.0,0.0,3.0,28.0,14.4542
75%,668.5,1.0,3.0,38.0,31.0
max,891.0,1.0,3.0,80.0,512.3292


<div class="alert alert-info">
<h3> Your turn</h3>
    <p> In addition to <code>describe</code>, there are several functions that we can apply on columns to calculate statistics such as the average value, standard deviation, maximum value etc. See if you can find a <code>pandas</code> function to calculate the median age of the passengers in <code>titanic</code>.
</div>

**Solution**

<details>
    
<summary> Click to expand!</summary>
<p> 

```c#
titanic['Age'].median()
```

</p>
</details> 

In [35]:
titanic["Age"].median()

28.0

`nunique` and `unique` shows the *number of unique values* and the *unique values* in a specific column.

In [36]:
titanic['Survived'].nunique()

2

In [37]:
titanic['Survived'].unique()

array([0, 1], dtype=int64)

`value_counts` counts the number of observations for each unique value in a column.

In [38]:
titanic['Survived'].value_counts()

0    549
1    342
Name: Survived, dtype: int64

<div class="alert alert-info">
<h3> Your turn</h3>
    <p> How many passengers in <code>titanic</code> were in first class?
</div>

**Solution**

<details>
    
<summary> Click to expand!</summary>
<p> 

```c#
first_class = titanic['Pclass'].value_counts().loc[1]

print('There were ' + str(first_class) + ' passengers in first class.')
```

</p>
</details> 

In [39]:
first_class = titanic["Pclass"].value_counts().loc[1]

print("There was "+ str(first_class)+ " passenger in first class")

There was 216 passenger in first class


`corr` calculates the correlation coefficient between all of the numeric columns in a `DataFrame`.

In [40]:
titanic.corr()

Unnamed: 0,PassengerId,Survived,Pclass,Age,Fare
PassengerId,1.0,-0.005007,-0.035144,0.036847,0.012658
Survived,-0.005007,1.0,-0.338481,-0.077221,0.257307
Pclass,-0.035144,-0.338481,1.0,-0.369226,-0.5495
Age,0.036847,-0.077221,-0.369226,1.0,0.096067
Fare,0.012658,0.257307,-0.5495,0.096067,1.0


#### Missing data

Missing data in `pandas` is denoted by the value `NaN`, which stand for 'not a number'. 

In [41]:
titanic.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Fare
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,13.0
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,30.0
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,23.45
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,30.0
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,7.75


We can count the total number of missing values in each column in a `DataFrame` by combining the `isna` and `sum` functions.

`isna` creates boolean values (`True`/`False`) for each cell in a `DataFrame` indicating whether or not cell has a missing value.

In [42]:
titanic.isna()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Fare
0,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False
887,False,False,False,False,False,False,False
888,False,False,False,False,False,True,False
889,False,False,False,False,False,False,False


We can then use `sum` to count the number of `True` in each column.

In [43]:
titanic.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
Fare             0
dtype: int64

## Converting `dtype`

When we import files, `pandas` infer the data types of the columns in the file. 

However, sometimes we want to change the `dtype` of the columns. Either because `pandas` got it wrong, or because there are other data types that are more appropriate for our analysis. 

The file `AAPL.csv` contains data on stock prices and trading volume for Apple on every weekday in 2020.

In [44]:
apple = pd.read_csv('data/AAPL.csv')

apple.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2020-01-02,74.059998,75.150002,73.797501,75.087502,74.333511,135480400
1,2020-01-03,74.287498,75.144997,74.125,74.357498,73.61084,146322800
2,2020-01-06,73.447502,74.989998,73.1875,74.949997,74.197395,118387200
3,2020-01-07,74.959999,75.224998,74.370003,74.597504,73.848442,108872000
4,2020-01-08,74.290001,76.110001,74.290001,75.797501,75.036385,132079200


In [45]:
len(apple)

252

In [46]:
apple.tail()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
247,2020-12-23,132.160004,132.429993,130.779999,130.960007,130.764603,88223700
248,2020-12-24,131.320007,133.460007,131.100006,131.970001,131.773087,54930100
249,2020-12-28,133.990005,137.339996,133.509995,136.690002,136.486053,124486200
250,2020-12-29,138.050003,138.789993,134.339996,134.869995,134.668762,121047300
251,2020-12-30,135.580002,135.990005,133.399994,133.720001,133.520477,96452100


`read_csv` imported the price data as floats, the trading volumes as integers, and the dates as strings.

In [47]:
apple.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252 entries, 0 to 251
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       252 non-null    object 
 1   Open       252 non-null    float64
 2   High       252 non-null    float64
 3   Low        252 non-null    float64
 4   Close      252 non-null    float64
 5   Adj Close  252 non-null    float64
 6   Volume     252 non-null    int64  
dtypes: float64(5), int64(1), object(1)
memory usage: 13.9+ KB


We can apply `astype` on a column in order to change the `dtype` of a column to `str`, `float` or `int`.

In order to modify the `DateFrame`, we must set the old column equal to the updated column using the `=` operator.

In [48]:
apple['Volume'] = apple['Volume'].astype('str')

apple.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2020-01-02,74.059998,75.150002,73.797501,75.087502,74.333511,135480400
1,2020-01-03,74.287498,75.144997,74.125,74.357498,73.61084,146322800
2,2020-01-06,73.447502,74.989998,73.1875,74.949997,74.197395,118387200
3,2020-01-07,74.959999,75.224998,74.370003,74.597504,73.848442,108872000
4,2020-01-08,74.290001,76.110001,74.290001,75.797501,75.036385,132079200


In [49]:
apple.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252 entries, 0 to 251
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       252 non-null    object 
 1   Open       252 non-null    float64
 2   High       252 non-null    float64
 3   Low        252 non-null    float64
 4   Close      252 non-null    float64
 5   Adj Close  252 non-null    float64
 6   Volume     252 non-null    object 
dtypes: float64(5), object(2)
memory usage: 13.9+ KB


In [50]:
apple['Volume'] = apple['Volume'].astype(float)

apple.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2020-01-02,74.059998,75.150002,73.797501,75.087502,74.333511,135480400.0
1,2020-01-03,74.287498,75.144997,74.125,74.357498,73.61084,146322800.0
2,2020-01-06,73.447502,74.989998,73.1875,74.949997,74.197395,118387200.0
3,2020-01-07,74.959999,75.224998,74.370003,74.597504,73.848442,108872000.0
4,2020-01-08,74.290001,76.110001,74.290001,75.797501,75.036385,132079200.0


In [51]:
apple['Volume'] = apple['Volume'].astype(int)

apple.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2020-01-02,74.059998,75.150002,73.797501,75.087502,74.333511,135480400
1,2020-01-03,74.287498,75.144997,74.125,74.357498,73.61084,146322800
2,2020-01-06,73.447502,74.989998,73.1875,74.949997,74.197395,118387200
3,2020-01-07,74.959999,75.224998,74.370003,74.597504,73.848442,108872000
4,2020-01-08,74.290001,76.110001,74.290001,75.797501,75.036385,132079200


In [52]:
apple.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252 entries, 0 to 251
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       252 non-null    object 
 1   Open       252 non-null    float64
 2   High       252 non-null    float64
 3   Low        252 non-null    float64
 4   Close      252 non-null    float64
 5   Adj Close  252 non-null    float64
 6   Volume     252 non-null    int32  
dtypes: float64(5), int32(1), object(1)
memory usage: 12.9+ KB


<div class="alert alert-info">
<h3> Your turn</h3>
    <p> Notice that the values in the <code>Age</code> column in <code>titanic</code> are floats and not integers. However, we normally think of ages as whole numbers and not numbers with decimals. Try and convert the <code>Age</code> column to integers. Why does this not work?
</div>

**Solution**

<details>
    
<summary> Click to expand!</summary>
<p> 

```c#
# try to convert to integers...
titanic['Age'] = titanic['Age'].astype(int)
    
# but it will throw a ValueError since the Age column contains NaN
# these are non-numbers and we cannot convert a non-number to an integer

```

</p>
</details> 

In [53]:
titanic["Age"] = titanic["Age"].astype(int)

#This will not work since the Age column contains NA, we cannot convert this "value" to integer.
titanic.info()

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

However, what data type is appropiate for the `Date` column? `str`, `float` or `int`?

In [None]:
apple.loc[0, 'Date']

Although we can interpret the `Date` column as a string, `pandas` was actually developed in order to handle time series data (especially financial time series). `pandas` therefore comes with an additional data type known as `datetime`.

`to_datetime` will convert a series of dates to `datetime`.

In [None]:
pd.to_datetime(apple['Date'])

In [None]:
apple['Date'] = pd.to_datetime(apple['Date'])

In [None]:
apple.info()

The `Date` column is now `datetime`, meaning that each value in the column is interpreted as a *timestamp*.

In [None]:
apple.loc[0, 'Date']

## Mandatory exercise, part 1

The file <code>mpg.xlsx</code> (in the data folder) contains observations on fuel economy and 6 additional attributes for 398 different car models. The column <code>mpg</code> is a measure of the car's fuel economy, i.e. the number of miles per gallon of petrol.
        
Import the file as a <code>DataFrame</code> and answer the following questions:

1. Which columns in the dataframe are strings?


2. What is the average number of miles per gallon of the car models in the data?


3. What are the unique number of cylinders observed in the data?


4. How many of the car models in the data were from Europe?


5. What is the correlation between cars' fuel economy and horsepower?


6. Are there any missing observations in the data?

In [None]:
mpg = pd.read_excel("data/mpg.xlsx")

mpg.info()  #1
mpg.mean()   #2
mpg['cylinders'].nunique()    #3
mpg["origin"].value_counts()    #4
mpg.corr()   #5
mpg.isna().sum()




In [None]:
#1. Column 7 and 8 contain strings.
#2. The average number of miles per gallon is 23.5145
#3. There are 5 different cylinders.
#4. 70 of the car models were from Europe.
#5. The correlation between mpg and horsepower is -0,775396.
#6. There are 6 missing observations in the data for the cars horsepower.