<h3> Luis Garduno <h3>

Dataset: [__International Database (IDB)__](https://www.census.gov/data-tools/demo/idb/#/country?COUNTRY_YEAR=2022&COUNTRY_YR_ANIM=2022)

Question Of Interest: Predict the population of earth in 2122.

# Data Understanding

## Data Description

In [None]:
import numpy as np
import pandas as pd

# Load dataset into dataframe
df = pd.read_csv('https://raw.githubusercontent.com/luisegarduno/MachineLearning_Projects/master/data/idb5yr.all', delimiter='|', encoding='ISO-8859-1')

df.info()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Make year column easier to understand
df.rename(columns={'#YR':'YEAR'}, inplace=True)

# Remove every column except for year & population
for col in df.columns.values:
    if col != 'YEAR' and col != 'POP':
        df.drop(col, axis=1, inplace=True)

df.describe()


------------------------

## Data Quality

In [None]:
import missingno as mn

mn.matrix(df)

# Count unique values in column 'gameId' of the dataframe
print('Number of unique values in column "YEAR" : ', df['YEAR'].nunique())


------------------------

## Clearning the Dataset

In [None]:
# Group by year & get sum
df_yr = df.groupby(by='YEAR')
df_yr = df_yr['POP'].sum()

# Create a new dataframe with new data (1961 - 2100)
pop_sum = []
for i in range(1961, 2022):
    pop_sum.append(df_yr[i])
df_pop = pd.DataFrame({'YEAR': list(range(1961, 2022)), 'POP': pop_sum})
df = df_pop

print(f'\n--> Current Population (2021): {df["POP"][60]:,d}\n')
df.tail(5)

In [None]:
sns.set_style("darkgrid")
plt.subplots(figsize=(10,7))
ax = sns.scatterplot(data=df, x='YEAR', y='POP', color='blue')
ax.set_xlabel('Years', fontsize=16)
ax.set_ylabel('Population', fontsize=16)
ax.set_title('World Population (1961-2021)', fontsize=18)
plt.xlim(1960, 2022)

plt.show()

In [None]:
# Define X & Y
if 'POP' in df_pop:
    y = df_pop['POP'].values
    del df_pop['POP']
    X = df_pop.to_numpy()


----------------------


# Modeling

Derived the formula for calculating the optimal values of the regression weights:

$$ w = (X^TX)^{-1}X^Ty $$

where $X$ is the matrix of values with a bias column of ones appended onto it.
For the population dataset one could construct this $X$ matrix by stacking a column of ones onto the `df_pop.YEAR` matrix. 

$$ X=\begin{bmatrix}
         & \vdots &        &  1 \\
        \dotsb & \text{ds.data} & \dotsb &  \vdots\\
         & \vdots &         &  1\\
     \end{bmatrix}
$$

In [None]:
# Create a matrix full of ones & stack 2 matrices horizontally
X = np.hstack((np.ones((len(X), 1)), X))

# Calculate optimal values of the regression weights
w = np.linalg.inv(X.T @ X) @ X.T @ y

print("\n++++++++++++++ WEIGHTS +++++++++++++++++\n", pd.DataFrame(data=w))
diff = np.round(( (y - (abs(np.dot(X,w) - y))) / y ) * 100, 2)
print("\n============= TARGET PERCENT ACCURACY ===============\n", pd.DataFrame(data=diff))


---------------------------

To predict the output from our model, $\hat{y}$,from $w$ and $X$ we need to use

$\hat{y}=w^TX^T$, for row vector $\hat{y}$

In [None]:
yHat_np = w.T @ X.T        # Shape : (1,61)
yHat_np = yHat_np.ravel()  # Shape : (61,)

MSE_np = (np.square(y - yHat_np)).mean()
print(f'MSE: {round(MSE_np):,d}')

In [None]:
X_new = np.array([[0], [2122]])
X_test = np.c_[np.ones((len(X_new), 1)), X_new]
y_test = X_test.dot(w)
print(f'\n--> ~Population (2122): {round(y_test[1]):,d}\n')

sns.set_style("darkgrid")
plt.subplots(figsize=(25,10))

plt.plot(X_new, y_test, "r-", color='red')
sns.scatterplot(data=df, x='YEAR', y=y, color='blue')
sns.scatterplot(x=X_new[1], y=y_test[1], s=100, marker="X", linewidth=1, edgecolor='k', color='gold')

plt.xlabel('Years', fontsize=20)
plt.ylabel('Population', fontsize=20)
plt.title('Population Prediction (1961-2122)', fontsize=20)
plt.axis([1960, 2124, 2000000000, 16500000000])
plt.legend(["Linear Regression", "Population", "Prediction"], prop={'size': 15})
plt.show()

-----------------------------

# Comparing Performance

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

reg = LinearRegression().fit(X, y)
MSE_sk = mean_squared_error(y, reg.predict(X))
y_testsk = reg.predict(X_test)

print("******** Linear Equation ********")
print("[Numpy] \th =", round(w[0],3), "* x + (" + str(round(w[1],5)) + ")")
print("[Sklearn]\th =", round(reg.intercept_,3), "* x + (" + str(round(reg.coef_[1],5)) + ")\n")

print("******** Mean Squared Error ********")
print(f'[Numpy] \tMSE: {round(MSE_np):,d}')
print(f'[Sklearn]\tMSE: {round(MSE_sk):,d}\n')

print("******** Population Prediction - 2122 ********")
print(f'[Numpy] \t {round(y_test[1]):,d}')
print(f'[Sklearn]\t {round(y_testsk[1]):,d}\n')


---------------------

#### References

Census. International Database (IDB). https://www.census.gov/data-tools/demo/idb/#/country?COUNTRY_YEAR=2022&COUNTRY_YR_ANIM=2022 (Accessed 01-22-2022)