
*   Computing Platforms: Set up the Workspace for Machine Learning Projects.  https://ms.pubpub.org/pub/computing
*  Machine Learning for Predictions. https://ms.pubpub.org/pub/ml-prediction
* Machine Learning Packages: https://scikit-learn.org/stable/


# Part I: Import and Inspect Data

In [1]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/Rising-Stars-by-Sunshine/Final-Project-Isabella/main/data/Queried_Data/Queried_data.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Year,Country,Emigrants
0,0,1990,Latin America and the Caribbean,15273399
1,1,1995,Latin America and the Caribbean,19669704
2,2,2000,Latin America and the Caribbean,24628700
3,3,2005,Latin America and the Caribbean,29338206
4,4,2010,Latin America and the Caribbean,34637650


# Part II: Prepare the Y varible for Regression

## 2.1. Write functions to calculte the Y variable for Regression 

*(skip the step if the Y variable already exists)*

In [3]:
df['theta'] = df['Emigrants']
df.head()

Unnamed: 0.1,Unnamed: 0,Year,Country,Emigrants,theta
0,0,1990,Latin America and the Caribbean,15273399,15273399
1,1,1995,Latin America and the Caribbean,19669704,19669704
2,2,2000,Latin America and the Caribbean,24628700,24628700
3,3,2005,Latin America and the Caribbean,29338206,29338206
4,4,2010,Latin America and the Caribbean,34637650,34637650


## 2.2. Make Sure that the Data Type of Y is "numeric"

In [4]:
df.dtypes

Unnamed: 0     int64
Year           int64
Country       object
Emigrants      int64
theta          int64
dtype: object

In [5]:
df['theta'] = pd.to_numeric(df['theta'])
df.dtypes

Unnamed: 0     int64
Year           int64
Country       object
Emigrants      int64
theta          int64
dtype: object

# Part III: Prepare the Y variable for Classification

reference:

https://datatofish.com/if-condition-in-pandas-dataframe/ *italicized text*

In [8]:
#@title Define the Congestion Threshold
cut = 20000000 #@param {type:"number"}


## 3.1. Method 1: If function

In [9]:
df['congested'] = df['theta'] >= cut
df.head()

Unnamed: 0.1,Unnamed: 0,Year,Country,Emigrants,theta,congested
0,0,1990,Latin America and the Caribbean,15273399,15273399,False
1,1,1995,Latin America and the Caribbean,19669704,19669704,False
2,2,2000,Latin America and the Caribbean,24628700,24628700,True
3,3,2005,Latin America and the Caribbean,29338206,29338206,True
4,4,2010,Latin America and the Caribbean,34637650,34637650,True


In [10]:
df.loc[(df['theta'] >= cut), 'congested'] = 1
df.loc[(df['theta'] <cut), 'congested'] = 0
df.head()

Unnamed: 0.1,Unnamed: 0,Year,Country,Emigrants,theta,congested
0,0,1990,Latin America and the Caribbean,15273399,15273399,0
1,1,1995,Latin America and the Caribbean,19669704,19669704,0
2,2,2000,Latin America and the Caribbean,24628700,24628700,1
3,3,2005,Latin America and the Caribbean,29338206,29338206,1
4,4,2010,Latin America and the Caribbean,34637650,34637650,1


# Part IV: Create the X variables

## 4.1. Shift the Y to get past values

reference:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shift.html

In [11]:
# generate a new variable as the previous 1 observable of your Y variable for regression
df['theta_past'] =df['theta'].shift(1)
df.head()

Unnamed: 0.1,Unnamed: 0,Year,Country,Emigrants,theta,congested,theta_past
0,0,1990,Latin America and the Caribbean,15273399,15273399,0,
1,1,1995,Latin America and the Caribbean,19669704,19669704,0,15273399.0
2,2,2000,Latin America and the Caribbean,24628700,24628700,1,19669704.0
3,3,2005,Latin America and the Caribbean,29338206,29338206,1,24628700.0
4,4,2010,Latin America and the Caribbean,34637650,34637650,1,29338206.0


## 4.2. Calculate the Moving Averages

references: 

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html

https://towardsdatascience.com/moving-averages-in-python-16170e20f6c

In [16]:
#@title Define the Window
window = 2 #@param {type:"number"}


In [17]:
df['theta_past_ma10']=df['theta_past'].rolling(window=window,min_periods=1).mean()
df.head(20)

Unnamed: 0.1,Unnamed: 0,Year,Country,Emigrants,theta,congested,theta_past,theta_past_ma10
0,0,1990,Latin America and the Caribbean,15273399,15273399,0,,
1,1,1995,Latin America and the Caribbean,19669704,19669704,0,15273399.0,15273399.0
2,2,2000,Latin America and the Caribbean,24628700,24628700,1,19669704.0,17471551.5
3,3,2005,Latin America and the Caribbean,29338206,29338206,1,24628700.0,22149202.0
4,4,2010,Latin America and the Caribbean,34637650,34637650,1,29338206.0,26983453.0
5,5,2015,Latin America and the Caribbean,36206000,36206000,1,34637650.0,31987928.0
6,6,2020,Latin America and the Caribbean,42890481,42890481,1,36206000.0,35421825.0


# Part V Train and Test Split

*reference*:

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html

In [18]:
from sklearn.model_selection import TimeSeriesSplit
tss = TimeSeriesSplit()
print(tss)

TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None)


In [72]:
# change the train and test split parameters 
tss = TimeSeriesSplit(gap=0, max_train_size=None, n_splits=2, test_size=3)

In [73]:
for train_idx, test_idx in tss.split(df):
    print("TRAIN:", train_idx, "TEST:", test_idx)

TRAIN: [0] TEST: [1 2 3]
TRAIN: [0 1 2 3] TEST: [4 5 6]


In [74]:
train_idx

array([0, 1, 2, 3])

In [75]:
test_idx

array([4, 5, 6])

In [76]:
train_df = df.filter(items=train_idx, axis=0)
test_df =  df.filter(items=test_idx, axis=0)

In [77]:
train_df.head()

Unnamed: 0.1,Unnamed: 0,Year,Country,Emigrants,theta,congested,theta_past,theta_past_ma10
0,0,1990,Latin America and the Caribbean,15273399,15273399,0,,
1,1,1995,Latin America and the Caribbean,19669704,19669704,0,15273399.0,15273399.0
2,2,2000,Latin America and the Caribbean,24628700,24628700,1,19669704.0,17471551.5
3,3,2005,Latin America and the Caribbean,29338206,29338206,1,24628700.0,22149202.0


In [78]:
test_df.head()

Unnamed: 0.1,Unnamed: 0,Year,Country,Emigrants,theta,congested,theta_past,theta_past_ma10
4,4,2010,Latin America and the Caribbean,34637650,34637650,1,29338206.0,26983453.0
5,5,2015,Latin America and the Caribbean,36206000,36206000,1,34637650.0,31987928.0
6,6,2020,Latin America and the Caribbean,42890481,42890481,1,36206000.0,35421825.0


# Part VI Prepare the Train and Test Data for Classification and Regression

## 6.1. Classification

### 6.1.1 Define the columns (Y, X) for Classification 

In [79]:
cols_C = ['congested','theta_past_ma10']

### 6.1.2 Define the Data Frame of Train and Test Data for Classification

In [80]:
df_C_train = train_df[cols_C]
df_C_test = test_df[cols_C]

### 6.1.3 Export the Train and Test Data for Classification

In [81]:
df_C_train.head()

Unnamed: 0,congested,theta_past_ma10
0,0,
1,0,15273399.0
2,1,17471551.5
3,1,22149202.0


In [82]:
df_C_train.to_csv('Classification_Train.csv')

In [83]:
df_C_test.head()

Unnamed: 0,congested,theta_past_ma10
4,1,26983453.0
5,1,31987928.0
6,1,35421825.0


In [84]:
df_C_test.to_csv('Classification_Test.csv')

## 6.2 Regression

### 6.2.1. Define the columns (Y, X) for Regression

In [85]:
cols_R = ['theta','theta_past_ma10']

### 6.2.2. Define the Data Frame of Train and Test Data for Regression

In [86]:
df_R_train = train_df[cols_R]
df_R_test = test_df[cols_R]

### 6.2.3. Export the Train and Test Data for Regression

In [87]:
df_R_train.head()

Unnamed: 0,theta,theta_past_ma10
0,15273399,
1,19669704,15273399.0
2,24628700,17471551.5
3,29338206,22149202.0


In [88]:
df_R_train.to_csv('Regression_Train.csv')

In [89]:
df_R_test.head()

Unnamed: 0,theta,theta_past_ma10
4,34637650,26983453.0
5,36206000,31987928.0
6,42890481,35421825.0


In [90]:
df_R_test.to_csv('Regression_Test.csv')