# Population Prediction

### Luis Garduno

## 1. Business Understanding

#### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <u><code>About the datasets:</code></u>
    
    In order to generate the most accurate prediction of earth's population in 2122, it's
    necessary to have population data for as many previous years as possible.
    
    The dataset that will be used will be a combination of 2 different datasets. The first
    dataset contains the global population data from 1951 to 2020, whereas the second
    dataset contains the each country's population data from 2021 to present time. 

-------------------------------------
    
Datasets [Kaggle]: 
1. [__World Population by Year (1951 - 2020)__](https://www.kaggle.com/sansuthi/world-population-by-year)
2. [__World Population (2021 - Present)__](https://www.kaggle.com/rsrishav/world-population)
3. [__International Database (IDB)__](https://www2.census.gov/programs-surveys/international-programs/about/idb/idbzip.zip)

Question Of Interest : Predict the population of earth in 2122.
    
-------------------------------------

## 2. Data Understanding

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.1 Data Description

In [113]:
import numpy as np
import pandas as pd

# Load each of the datasets into a dataframe

print("*********** Dataset #1 (1951 - 2020) ***********\n")
df_1 = pd.read_csv('https://raw.githubusercontent.com/luisegarduno/MachineLearning_Projects/master/data/world-population_1.csv')
print("--> Columns:", df_1.columns.values)
#print(df_1.info())

print("\n")

print("*********** Dataset #2 (2021 - Present) ***********\n")
df_2 = pd.read_csv('https://raw.githubusercontent.com/luisegarduno/MachineLearning_Projects/master/data/world-population_2.csv')
print("--> Columns:", df_2.columns.values)
#print(df_2.info())

print("\n")

print("*********** Dataset #3 (1951 - Present) ***********\n")
df_3 = pd.read_csv('https://raw.githubusercontent.com/luisegarduno/MachineLearning_Projects/master/data/idb5yr.all', delimiter='|', encoding='ISO-8859-1')
print("--> Columns:", df_3.columns.values)

*********** Dataset #1 (1951 - 2020) ***********

--> Columns: ['Year' 'Population' 'ChangePerc' 'NetChange' 'Density' 'Urban'
 'UrbanPerc']


*********** Dataset #2 (2021 - Present) ***********

--> Columns: ['iso_code' 'country' '2021_last_updated' '2020_population' 'area'
 'density_sq_km' 'growth_rate' 'world_%' 'rank']
*********** Dataset #3 (1951 - Present) ***********

--> Columns: ['#YR' 'TFR' 'SRB' 'RNI' 'POP95_99' 'POP90_94' 'POP85_89' 'POP80_84'
 'POP75_79' 'POP70_74' 'POP65_69' 'POP60_64' 'POP5_9' 'POP55_59'
 'POP50_54' 'POP45_49' 'POP40_44' 'POP35_39' 'POP30_34' 'POP25_29'
 'POP20_24' 'POP15_19' 'POP10_14' 'POP100_' 'POP0_4' 'POP' 'NMR' 'NAME'
 'MR1_4' 'MR0_4' 'MPOP95_99' 'MPOP90_94' 'MPOP85_89' 'MPOP80_84'
 'MPOP75_79' 'MPOP70_74' 'MPOP65_69' 'MPOP60_64' 'MPOP5_9' 'MPOP55_59'
 'MPOP50_54' 'MPOP45_49' 'MPOP40_44' 'MPOP35_39' 'MPOP30_34' 'MPOP25_29'
 'MPOP20_24' 'MPOP15_19' 'MPOP10_14' 'MPOP100_' 'MPOP0_4' 'MPOP' 'MMR1_4'
 'MMR0_4' 'IMR_M' 'IMR_F' 'IMR' 'GRR' 'GR' 'FPOP95_99

---------------------------------

Printing out the information about the dataframe we are able to see that there are a
total of 7 attributes in the first dataset, and 8 in the second.

Attributes includes:
- Description

Below is a brief description of some of the key attributes

In [176]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Make year column easier to understand
df_3.rename(columns={'#YR':'YEAR'}, inplace=True)

# Remove every column except for year & population
for col in df_3.columns.values:
    if col != 'YEAR' and col != 'POP':
        df_3.drop(col, axis=1, inplace=True)

# Group by year & get sum
df = df_3.groupby(by='YEAR')
df = df['POP'].sum()

# Finally create a new dataframe with new data
pop_sum = []
for i in range(1951, 2023):
    pop_sum.append(df[i])
df_pop = pd.DataFrame({'YEAR': list(range(1951, 2023)), 'POP': pop_sum})

print("Current:", df_pop['POP'][71])
df_pop.tail(5)

Current: 7905336896


Unnamed: 0,YEAR,POP
67,2018,7597066210
68,2019,7676686052
69,2020,7756873419
70,2021,7831718605
71,2022,7905336896


| Variable | Description | Type | Range |
| -------- | ----------- | ---- | ----- |
| A        | B           | C    | D     |   

The numbers above match exactly with what the numbers shown on the [__IDB web tool__](https://www.census.gov/data-tools/demo/idb/#/country?COUNTRY_YEAR=2022&COUNTRY_YR_ANIM=2022)

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.2 Normalizing the Dataset

In [151]:
# Python

total_2020 = 0
total_2021 = 0
for r in range(len(df_2)):
    total_2020 += int(df_2['2020_population'][r].replace(',',''))
    total_2021 += int(df_2['2021_last_updated'][r].replace(',',''))
    #total += int(aye)
    #total += df_2['2021_last_updated'][r]

print("Total - 2020:", total_2020)
print("Total - 2021:", total_2021)
    

Total - 2020: 7697427912
Total - 2021: 7817937695




---------------

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.3 Data Quality

Using the `missingno` package, we are able to additionally confirm that all the data is complete and there is no missing entries with the dataset. If there was missing data, we could impute the missing values by using the k-nearest neighbor. But if an instance was missing a majority of its attributes, it would be removed from the dataset.

The number of unique values in the column " " is printed to verify that all instances
are weighted equally.

In [None]:
import missingno as mn

mn.matrix(df)

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.4 Cleaning the Dataset

In [None]:
# python

# Given as to how



del df['']


-------------------

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.5 Creating Training & Test Data
Using Scikit-learn's [cross-validation modules](https://scikit-learn.org/stable/modules/cross_validation.html) we are able to split our dataset for training and testing purposes.

In [None]:
from sklearn.model_selection import train_test_split

# Create X data & y target dataframe's
y = df[''].values
# del df['']
X = df.to_numpy()


# Divide the data: 80% Training & 20% Testing.  
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

print("Training Set", "\n   - Data Shape:",X_train.shape,"\n   - Target Shape:",y_train.shape)
print("\nTesting Set","\n   - Data Shape:",X_test.shape ,"\n   - Target Shape:",y_test.shape)

-------------------

We perform a split within our dataset: 80% will be used for training, and 20% for testing. The 80/20 split is appropriate for
the dataset because recall that the end goal is for users to be able to determine the probabilities of the earth's population 100 years from now.


--------------------

## 3. Modeling

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.1 Model

In [None]:
# python


----------------------

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.2 Custom Classifier Training


In [None]:
# Visualize performance


------------------------

## 4. Deployment




---------------------

#### References

Worldometer. World Population by Year. https://www.worldometers.info/world-population/world-population-by-year/ (Accessed 01-22-2022)

Scikit-learn. Cross-validation. https://scikit-learn.org/stable/modules/cross_validation.html (Accessed 01-22-2022)