## setup

In [None]:
import pandas as pd

Here is a description of each column based on the dataset you provided and the information from the Kaggle site:

1. **Year**: This column represents the year from 1955 to 2024, indicating the year for which the population data is recorded.

2. **Population**: The total estimated world population for the given year. The values are given in absolute numbers, typically formatted with commas to separate thousands, millions, etc.

3. **Yearly % Change**: The percentage increase or decrease in the world population compared to the previous year. This column shows how the global population has grown or shrunk over time.

4. **Yearly Change**: This column represents the absolute change in population from the previous year, expressed in terms of the number of people added or lost.

5. **Median Age**: The median age of the world population for that year. This value reflects the point where half the population is younger, and half is older, providing insights into the age distribution of the global population.

6. **Fertility Rate**: This column indicates the average number of children born per woman in the given year. It provides an idea of birth rates and population growth potential.

7. **Density (P/Km²)**: The population density, expressed as the number of people per square kilometer. It provides a measure of how densely populated the Earth is on average during that year.

Let me know if you would like to analyze any specific trends or perform computations on this data!

### What is Root Mean Squared Error (RMSE)?

Root Mean Squared Error (RMSE) is a metric that measures the average magnitude of the error between predicted values and actual values. It calculates the square root of the average squared differences between the predicted values (\(\hat{y}_i\)) and the actual values (\(y_i\)):

\[
RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}
\]

Where:
- \( n \) is the number of data points.
- \( y_i \) is the actual population value.
- \( \hat{y}_i \) is the predicted population value by the model.

### Why RMSE is a Good Choice for Evaluating RNNs in Global Population Prediction:

1. **Interpretability**:
   - RMSE is measured in the same units as the target variable, in this case, population. This makes the metric easy to interpret because it tells you, on average, how many people the model is off by in its predictions.
   
2. **Sensitivity to Large Errors**:
   - RMSE gives higher weight to large errors because the errors are squared before averaging. This is beneficial when predicting population growth because large errors in population predictions are significant and need to be penalized more severely. For example, underestimating or overestimating by millions could have a major impact.

3. **Captures Overall Fit**:
   - RMSE considers the overall accuracy of the model by measuring the average magnitude of error across all data points. This is useful when predicting global population growth trends over a long time series, as it helps to ensure the model is making accurate predictions on average.

4. **Smooth Error Distribution**:
   - RMSE tends to smooth out the effects of smaller errors and highlight the more important large deviations, which is particularly relevant for population predictions where occasional outliers could have a considerable impact (e.g., periods of sudden population changes due to events like pandemics, wars, or baby booms).

5. **Widely Used in Time Series Forecasting**:
   - RMSE is one of the most common metrics in time series forecasting, making it easy to compare against other models or baselines when predicting population growth.

In the context of population prediction using an RNN, RMSE allows you to understand how well the model is capturing the overall trends and where significant deviations occur, providing a robust sense of prediction accuracy.

## code

In [None]:
df = pd.read_csv("/content/world_population_data_1955-2024.csv")
df

Unnamed: 0,Year,Population,Yearly % Change,Yearly Change,Median Age,Fertility Rate,Density (P/Km²)
0,2024,8118835999,0.91 %,73524552,30.7,2.31,55
1,2023,8045311447,0.88 %,70206291,30.5,2.31,54
2,2022,7975105156,0.83 %,65810005,30.2,2.31,54
3,2021,7909295151,0.87 %,68342271,30.0,2.32,53
4,2020,7840952880,0.98 %,76001848,29.7,2.35,53
5,2015,7426597537,1.23 %,88198886,28.0,2.52,50
6,2010,6985603105,1.27 %,85485397,27.0,2.59,47
7,2005,6558176119,1.30 %,81855429,26.0,2.62,44
8,2000,6148898975,1.37 %,81135904,25.0,2.73,41
9,1995,5743219454,1.56 %,85408718,24.0,2.88,39


In [None]:
df['Population'] = df['Population'].str.replace(',', '').astype(float)
df['Yearly % Change'] = df['Yearly % Change'].str.replace('%', '').astype(float)
df['Yearly Change'] = df['Yearly Change'].str.replace(',', '').astype(float)

In [None]:
df

Unnamed: 0,Year,Population,Yearly % Change,Yearly Change,Median Age,Fertility Rate,Density (P/Km²)
0,2024,8118836000.0,0.91,73524552.0,30.7,2.31,55
1,2023,8045311000.0,0.88,70206291.0,30.5,2.31,54
2,2022,7975105000.0,0.83,65810005.0,30.2,2.31,54
3,2021,7909295000.0,0.87,68342271.0,30.0,2.32,53
4,2020,7840953000.0,0.98,76001848.0,29.7,2.35,53
5,2015,7426598000.0,1.23,88198886.0,28.0,2.52,50
6,2010,6985603000.0,1.27,85485397.0,27.0,2.59,47
7,2005,6558176000.0,1.3,81855429.0,26.0,2.62,44
8,2000,6148899000.0,1.37,81135904.0,25.0,2.73,41
9,1995,5743219000.0,1.56,85408718.0,24.0,2.88,39


In [None]:
df = df.drop([0, 1])

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df[['Population', 'Yearly % Change', 'Yearly Change', 'Median Age', 'Fertility Rate', 'Density (P/Km²)']])

In [None]:
import numpy as np

In [None]:
def create_sequences_multifeature(data, sequence_length):
    sequences = []
    targets = []
    for i in range(len(data) - sequence_length):
        seq = data[i:i+sequence_length]
        target = data[i+sequence_length, 0]
        sequences.append(seq)
        targets.append(target)
    return np.array(sequences), np.array(targets)

In [None]:
sequence_length = 8
X, y = create_sequences_multifeature(scaled_data, sequence_length)

In [None]:
train_size = int(len(X) * 0.7)
X_train, X_test = X[:train_size], X[train_size:]
y_train, y_test = y[:train_size], y[train_size:]

In [None]:
X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], X_train.shape[2]))
X_test = X_test.reshape((X_test.shape[0], X_test.shape[1], X_test.shape[2]))

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], X_train.shape[2]))

model = Sequential()
model.add(LSTM(100, return_sequences=True, input_shape=(X_train.shape[1], X_train.shape[2])))
model.add(Dropout(0.2))
model.add(LSTM(100, return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(50))
model.add(Dense(1))

model.compile(optimizer='adam', loss='mean_squared_error')
model.fit(X_train, y_train, batch_size=1, epochs=10)

Epoch 1/10


  super().__init__(**kwargs)


[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 11ms/step - loss: 0.0350
Epoch 2/10
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - loss: 0.0223    
Epoch 3/10
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - loss: 0.0301
Epoch 4/10
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - loss: 0.0047    
Epoch 5/10
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - loss: 0.0016    
Epoch 6/10
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - loss: 0.0045
Epoch 7/10
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step - loss: 0.0040
Epoch 8/10
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - loss: 0.0033
Epoch 9/10
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step - loss: 0.0032
Epoch 10/10
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step - loss: 0.0022    


<keras.src.callbacks.history.History at 0x7f8ed3f37730>

In [None]:
last_sequence = scaled_data[-sequence_length:]

last_sequence = last_sequence.reshape((1, last_sequence.shape[0], last_sequence.shape[1]))

predicted_population_2023_scaled = model.predict(last_sequence)

predicted_population_2023 = scaler.inverse_transform([[predicted_population_2023_scaled[0][0], 0, 0, 0, 0, 0]])[0][0]

print(f"Predicted global population for 2023: {predicted_population_2023}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 301ms/step
Predicted global population for 2023: 2438218441.5800743


In [None]:
last_sequence

array([[[0.4915065 , 0.78861789, 1.        , 0.29411765, 0.36101083,
         0.5       ],
        [0.40459842, 0.79674797, 0.82319123, 0.19607843, 0.4368231 ,
         0.41666667],
        [0.32471311, 0.77235772, 0.61542323, 0.19607843, 0.5198556 ,
         0.33333333],
        [0.25308027, 0.91056911, 0.61290232, 0.09803922, 0.63898917,
         0.25      ],
        [0.18154756, 1.        , 0.5369808 , 0.        , 0.90974729,
         0.19444444],
        [0.11303043, 0.96747967, 0.34246599, 0.09803922, 1.        ,
         0.11111111],
        [0.05223935, 0.87804878, 0.12716376, 0.09803922, 0.86281588,
         0.05555556],
        [0.        , 0.8699187 , 0.        , 0.19607843, 0.97472924,
         0.        ]]])