# 75 pandas Exercises: Exercises 31 to 40

Exercises 31 to 40 from [here](https://www.machinelearningplus.com/python/101-pandas-exercises-python/). Each exercise includes the question, the input and the solution's code. Sometimes, alternative solutions and comments to better explain solutions/pandas functionality are offered.

Requirements: 
+ `pandas`
+ `numpy`

Happy Pandasing! 🐼

# Imports 

In [2]:
import numpy as np
import pandas as pd

---

## Exercises

### 🐼 Exercise 31

**Fill an intermittent time series so all missing dates show up with values of previous non-missing date.** `ser` has missing dates and values. Make all missing dates appear and fill up with value from previous date.

Input

In [3]:
ser = pd.Series([1,10,3,np.nan], index=pd.to_datetime(['2000-01-01', '2000-01-03', '2000-01-06', '2000-01-08']))
print(ser)

2000-01-01     1.0
2000-01-03    10.0
2000-01-06     3.0
2000-01-08     NaN
dtype: float64


Basically, two approaches (generate a new daily index starting on our first day and ending on the last, applying it with `pd.Series.reindex`), setting the new value as NaN and then doing NaN-filling by copying the previous value. Alternatively, we can resample our `pd.Series` and get one of them done for free. Let's try both and see that one of them is not like the others. 

Generating the new index: 

In [6]:
new_index = pd.date_range(start=ser.index[0], end=ser.index[-1], freq='D') # D stands for daily frequency

Reindexing:

In [10]:
ser_reindex = ser.reindex(new_index, fill_value=np.nan)
print(ser_reindex)

2000-01-01     1.0
2000-01-02     NaN
2000-01-03    10.0
2000-01-04     NaN
2000-01-05     NaN
2000-01-06     3.0
2000-01-07     NaN
2000-01-08     NaN
Freq: D, dtype: float64


Now, let's fill NaNs!

In [24]:
ser_fixed = ser_reindex.fillna(method='pad')  # pad is forward fill, equivalent to propagating valid values forward
print(ser_fixed)

2000-01-01     1.0
2000-01-02     1.0
2000-01-03    10.0
2000-01-04    10.0
2000-01-05    10.0
2000-01-06     3.0
2000-01-07     3.0
2000-01-08     3.0
Freq: D, dtype: float64


Done. Let's try it the other way. 

In [26]:
ser_fixed_alt = ser.resample(rule='D').pad() # resample to a daily frequency then call filling with a daily frequency
print(ser_fixed_alt)

2000-01-01     1.0
2000-01-02     1.0
2000-01-03    10.0
2000-01-04    10.0
2000-01-05    10.0
2000-01-06     3.0
2000-01-07     3.0
2000-01-08     NaN
Freq: D, dtype: float64


Okay, seems we need a final operation (the second to last value was a NaN, so forward propagation was not valid).

In [27]:
ser_fixed_alt.fillna(method='pad', inplace=True)
print(ser_fixed_alt)

2000-01-01     1.0
2000-01-02     1.0
2000-01-03    10.0
2000-01-04    10.0
2000-01-05    10.0
2000-01-06     3.0
2000-01-07     3.0
2000-01-08     3.0
Freq: D, dtype: float64


Done. Quicker, _huh?_

### 🐼 Exercise 32

**Compute the autocorrelations of a numeric series.** Compute autocorrelations for the first 10 lags of `ser`. Find out which lag has the largest correlation.

Input

In [34]:
ser = pd.Series(np.arange(20) + np.random.normal(1, 10, 20))

So, `pandas` has built-in support for calculating the autocorrelation of a signal, which is just sliding two copies of the signal one over the other and computing the value of their convolution operation. Auto-correlation becomes particularly interesting when you can shift one time series `x` samples relative to the other (lagging one of them). `pandas` has support for that too. 

In [35]:
lag_values = np.arange(0, 10)
auto_corr_vals = [ser.autocorr(lag=i) for i in lag_values]

Let's checkout the value of the autocorrelations (ignoring the first, has the signal will exactly equal to itself when there's no lag involved): 

In [39]:
print(auto_corr_vals[1:])

[-0.03755812497809473, -0.02109999814539558, 0.03181184882563077, -0.29807718748493656, -0.05921651437011084, 0.7923951272939062, 0.0012462196812533216, 0.2885403792390042, -0.1564077878210171]


In [40]:
print("The lag value with the highest autocorrelation is {}.".format(np.argmax(auto_corr_vals[1:])+1))

The lag value with the highest autocorrelation is 6.


### 🐼 Exercise 33

**Import only every nth row from a csv file to create a dataframe.** Import every 50th row of the [Boston Housing](https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv) dataset as a dataframe.

Input

In [79]:
dataset_link = 'https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv'

The dataset is in the `.csv` format and `pandas` has a very competent `pd.read_csv` method, so let's bet on it. 

In [102]:
data = pd.read_csv(dataset_link, chunksize=50)

The `chunksize` argument is used when we want to load big files (which won't fit into memory). It returns an iterator - not actual data, but directions of where to go when a certain chunk of the data is required. So, we now have these chunks of 50 rows of the dataset, and we are only interested in the first row. So, we progressively extract that.

In [103]:
data_filtered = pd.DataFrame() # empty dataframe to which we will be appending things

for chunk in data: 
    data_filtered = data_filtered.append(chunk.iloc[0]) # we append all columns of the first row of the chunk

In [105]:
data_filtered.head()

Unnamed: 0,age,b,chas,crim,dis,indus,lstat,medv,nox,ptratio,rad,rm,tax,zn
0,65.2,396.9,0.0,0.00632,4.09,2.31,4.98,24.0,0.538,15.3,1.0,6.575,296.0,18.0
50,45.7,395.56,0.0,0.08873,6.8147,5.64,13.45,19.7,0.439,16.8,4.0,5.963,243.0,21.0
100,79.9,394.76,0.0,0.14866,2.7778,8.56,9.42,27.5,0.52,20.9,5.0,6.727,384.0,0.0
150,97.3,372.8,0.0,1.6566,1.618,19.58,14.1,21.5,0.871,14.7,5.0,6.122,403.0,0.0
200,13.9,384.3,0.0,0.01778,7.6534,1.47,4.45,32.9,0.403,17.0,3.0,7.135,402.0,95.0


Done!

### 🐼 Exercise 34