# Pandas 2 - Questions
© Advanced Analytics, Amir Ben Haim, 2024

## Cleaning Data

Data containing malformed strings, Python, lists and missing data.
Tidy it up so you can get on with the analysis.

### Exercise 1

Import pandas under the name `pd`

In [1]:
import pandas as pd

### Exercise 2

Consider the following Python dictionary `data`:

``` python
data = {
        'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm', 'Budapest_PaRis', 'Brussels_londOn'],
        'FlightNumber': [10045, None, None, None, 10085],
        'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],
        'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )', '12. Air France', '"Swiss Air"'],
        'Landed':['yes','no','yes','no','no']
        }
```
<br>
Create a DataFrame `df` from this dictionary `data`.
<br>

In [2]:
data = {
        'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm', 'Budapest_PaRis', 'Brussels_londOn'],
        'FlightNumber': [10045, None, None, None, 10085],
        'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],
        'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )', '12. Air France', '"Swiss Air"'],
        'Landed':['yes','no','yes','no','no']
        }

df = pd.DataFrame(data)
df

Unnamed: 0,From_To,FlightNumber,RecentDelays,Airline,Landed
0,LoNDon_paris,10045.0,"[23, 47]",KLM(!),yes
1,MAdrid_miLAN,,[],<Air France> (12),no
2,londON_StockhOlm,,"[24, 43, 87]",(British Airways. ),yes
3,Budapest_PaRis,,[13],12. Air France,no
4,Brussels_londOn,10085.0,"[67, 32]","""Swiss Air""",no


### Exercise 3

The From\_To column would be better as two separate columns!
<br>Split each string on the underscore delimiter `_` to give a <u>new temporary DataFrame</u> with the correct values.
<br>Assign the correct column names to this temporary DataFrame.

In [3]:
x = df.From_To.str.split("_",expand=True)
x.columns = ["From", "to"]
x.From = x.From.str.capitalize()
x.to = x.to.str.capitalize()
df = df.join(x).drop("From_To",axis=1)
df

Unnamed: 0,FlightNumber,RecentDelays,Airline,Landed,From,to
0,10045.0,"[23, 47]",KLM(!),yes,London,Paris
1,,[],<Air France> (12),no,Madrid,Milan
2,,"[24, 43, 87]",(British Airways. ),yes,London,Stockholm
3,,[13],12. Air France,no,Budapest,Paris
4,10085.0,"[67, 32]","""Swiss Air""",no,Brussels,London


### Exercise 4

Notice how the capitalisation of the city names is all mixed up in this temporary DataFrame.
<br>Standardise the strings so that only the first letter is uppercase (e.g. "londON" should become "London".)

### Exercise 5

Delete the From_To column from `df` and attach the <u>temporary DataFrame</u> from the previous questions.

### Exercise 6

In the Airline column, you can see some extra punctuation and symbols have appeared around the airline names.
<br>Pull out just the airline name. E.g. `'(British Airways. )'` should become `'British Airways'`.

In [4]:
df.Airline = df.Airline.str.extract('([a-zA-Z\\s]+)')
df.Airline = df.Airline.str.strip()
df

Unnamed: 0,FlightNumber,RecentDelays,Airline,Landed,From,to
0,10045.0,"[23, 47]",KLM,yes,London,Paris
1,,[],Air France,no,Madrid,Milan
2,,"[24, 43, 87]",British Airways,yes,London,Stockholm
3,,[13],Air France,no,Budapest,Paris
4,10085.0,"[67, 32]",Swiss Air,no,Brussels,London


### Exercise 7

In the RecentDelays column, the values have been entered into the DataFrame as a list.
<br>We would like each first value in its own column, each second value in its own column, and so on.
<br>If there isn't an Nth value, the value should be NaN.
<br>Expand the Series of lists into a DataFrame named `delays`, rename the columns `delay_1`, `delay_2`, etc.
<br>And replace the unwanted RecentDelays column in `df` with `delays`.

In [5]:
a = df.RecentDelays.apply(pd.Series,dtype="object")


a.columns = [f"delay{v}" for v in range(1,len(a.columns)+1)]

df= df.join(a)


df

Unnamed: 0,FlightNumber,RecentDelays,Airline,Landed,From,to,delay1,delay2,delay3
0,10045.0,"[23, 47]",KLM,yes,London,Paris,23.0,47.0,
1,,[],Air France,no,Madrid,Milan,,,
2,,"[24, 43, 87]",British Airways,yes,London,Stockholm,24.0,43.0,87.0
3,,[13],Air France,no,Budapest,Paris,13.0,,
4,10085.0,"[67, 32]",Swiss Air,no,Brussels,London,67.0,32.0,


In [6]:
col= [c for c in df.columns if c not in ["delay1", "delay2", "delay3"]]
df1 = df.melt(id_vars=  col ,value_name = 'Delay', var_name='s')


In [7]:
df

Unnamed: 0,FlightNumber,RecentDelays,Airline,Landed,From,to,delay1,delay2,delay3
0,10045.0,"[23, 47]",KLM,yes,London,Paris,23.0,47.0,
1,,[],Air France,no,Madrid,Milan,,,
2,,"[24, 43, 87]",British Airways,yes,London,Stockholm,24.0,43.0,87.0
3,,[13],Air France,no,Budapest,Paris,13.0,,
4,10085.0,"[67, 32]",Swiss Air,no,Brussels,London,67.0,32.0,


In [8]:
df1 = df1[df1.Delay.notna()]
df2 = df[df.delay1.isna() & df.delay2.isna() & df.delay3.isna()].copy()
df2.loc[:, "Delay"] = 0.0
df2.loc[:, "s"] = "No Delay"
df2 = df2[[c for c in df2.columns if c not in ["delay1", "delay2", "delay3"]]]
df= pd.concat([df1,df2], ignore_index= True)


In [15]:
df = df.drop('RecentDelays', axis= 1)

In [21]:
df = df.rename(columns={'s': 'DelayNum'})

### Exercise 8

The 'Landed' column contains the values 'yes' and 'no'.
<br>Replace this column with a column of boolean values: 'yes' should be `True` and 'no' should be `False`.
<br>
**HINT:**
`map()`

In [10]:
df.Landed = df.Landed.map({'yes': True, 'no': False })
df

Unnamed: 0,FlightNumber,RecentDelays,Airline,Landed,From,to,s,Delay
0,10045.0,"[23, 47]",KLM,True,London,Paris,delay1,23.0
1,,"[24, 43, 87]",British Airways,True,London,Stockholm,delay1,24.0
2,,[13],Air France,False,Budapest,Paris,delay1,13.0
3,10085.0,"[67, 32]",Swiss Air,False,Brussels,London,delay1,67.0
4,10045.0,"[23, 47]",KLM,True,London,Paris,delay2,47.0
5,,"[24, 43, 87]",British Airways,True,London,Stockholm,delay2,43.0
6,10085.0,"[67, 32]",Swiss Air,False,Brussels,London,delay2,32.0
7,,"[24, 43, 87]",British Airways,True,London,Stockholm,delay3,87.0
8,,[],Air France,False,Madrid,Milan,No Delay,0.0


### Exercise 9

In the 'Airline' column, change the 'Air France' entries to 'El-Al'.
<br>
**HINT:**
`replace()`

In [23]:
df.Airline= df.Airline.replace({'Air France':'El-Al'})
df

Unnamed: 0,FlightNumber,Airline,Landed,From,to,DelayNum,Delay
0,10045.0,KLM,True,London,Paris,delay1,23.0
1,,British Airways,True,London,Stockholm,delay1,24.0
2,,El-Al,False,Budapest,Paris,delay1,13.0
3,10085.0,Swiss Air,False,Brussels,London,delay1,67.0
4,10045.0,KLM,True,London,Paris,delay2,47.0
5,,British Airways,True,London,Stockholm,delay2,43.0
6,10085.0,Swiss Air,False,Brussels,London,delay2,32.0
7,,British Airways,True,London,Stockholm,delay3,87.0
8,,El-Al,False,Madrid,Milan,No Delay,0.0


### Exercise 10

Append a new row 'k' to `df` with your choice of values for each column.

In [29]:
df.loc['k'] ={"FlightNumber": 10095.0, 'Airline':'America Airlines', 'Landed':True, 'From' : 'Seatle' , 'to' : 'Tel-Aviv' , 'DelayNum' : 'No Delay' ,  'Delay' : 0 }

### Exercise 11

Delete that row ('k') to return the original DataFrame.

In [30]:
df

Unnamed: 0,FlightNumber,Airline,Landed,From,to,DelayNum,Delay
0,10045.0,KLM,True,London,Paris,delay1,23.0
1,,British Airways,True,London,Stockholm,delay1,24.0
2,,El-Al,False,Budapest,Paris,delay1,13.0
3,10085.0,Swiss Air,False,Brussels,London,delay1,67.0
4,10045.0,KLM,True,London,Paris,delay2,47.0
5,,British Airways,True,London,Stockholm,delay2,43.0
6,10085.0,Swiss Air,False,Brussels,London,delay2,32.0
7,,British Airways,True,London,Stockholm,delay3,87.0
8,,El-Al,False,Madrid,Milan,No Delay,0.0
k,10095.0,America Airlines,True,Seatle,Tel-Aviv,No Delay,0.0


In [33]:
df = df.drop('k')

KeyError: "['k'] not found in axis"

### Exercise 12

For column 'FlightNumber' check if the <u>FIRST VALUE (check the min index) is `None`</u>.
<br>If the value is 'None' set it to '1000', else print the FIRST VALUE.

In [None]:
# finding the First min index


0

In [None]:
# Checking if the first value is None


10045.0


### Exercise 13

Use a loop to fill all the missing values.
Missing values should be filled as the <u>value before + 10</u>

Unnamed: 0,FlightNumber,Airline,Landed,From,To,delay_1,delay_2,delay_3
0,10045.0,KLM,True,London,Paris,23.0,47.0,
1,10055.0,El-Al,False,Madrid,Milan,,,
2,10065.0,British Airways,True,London,Stockholm,24.0,43.0,87.0
3,10075.0,El-Al,False,Budapest,Paris,13.0,,
4,10085.0,Swiss Air,False,Brussels,London,67.0,32.0,


### Exercise 14

Change column 'FlightNumber' to data type int

Unnamed: 0,FlightNumber,Airline,Landed,From,To,delay_1,delay_2,delay_3
0,10045,KLM,True,London,Paris,23.0,47.0,
1,10055,El-Al,False,Madrid,Milan,,,
2,10065,British Airways,True,London,Stockholm,24.0,43.0,87.0
3,10075,El-Al,False,Budapest,Paris,13.0,,
4,10085,Swiss Air,False,Brussels,London,67.0,32.0,
