### Predicting the Future World Population

In this project, we are looking to predict the future population of the world using the data from the past (1960 to 2018). This is achieved using the magic of machine learning and more specific, regression.

The dataset for the total population  was taken from World Bank [World Bank](https://data.worldbank.org/indicator/SP.POP.TOTL) and will be in a [comma separated format (csv)](https://en.wikipedia.org/wiki/Comma-separated_values) comma separated format (csv). The project will be coded up in Python so knowledge of the pandas and scikit-learn packages will be required.

Overall, our goal is to create a csv file of their future population up to 10 years away from now, into the file ‘future_world_population.csv’.


#### Resource

To begin, download the zip file below and extract out the csv file with the largest size, the ‘API_SP.POP.TOTL_DS2_en_csv_v2_103676.csv’ file. After putting the file in your directory it is better to rename it to something like ‘world_population.csv’ for convenience.

- World Bank Total Population (csv) (1960 – 2018) ([Download Link](https://data.worldbank.org/indicator/SP.POP.TOTL))





### Importing Packages

We import the packages needed for the analysis and that is at the minimum the pandas and the scikit-learn. Matplotlib is used for all the data visualizations to draw questions about the data and to format the data if necessary while numpy is used in order to convert the data to numpy arrays and/or be reshaped into different dimensions.

In [1]:
import matplotlib.pyplot as plt
from matplotlib.ticker import FormatStrFormatter
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

#### Loading the Data

When the csv file is opened using Microsoft Excel, you will notice the  first four rows is not apart of the data. 

![csv file](https://github.com/adam-p/markdown-here/raw/master/src/common/images/icon48.png "Logo Title Text 1")

Hence, when importing the csv file to pandas we can skip the first four rows.

In [6]:
pop= 'API_SP.POP.TOTL_DS2_en_csv_v2_103676.csv'
df = pd.read_csv(pop, skiprows=4)

df.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,Unnamed: 63
0,Aruba,ABW,"Population, total",SP.POP.TOTL,54211.0,55438.0,56225.0,56695.0,57032.0,57360.0,...,101669.0,102046.0,102560.0,103159.0,103774.0,104341.0,104872.0,105366.0,105845.0,
1,Afghanistan,AFG,"Population, total",SP.POP.TOTL,8996973.0,9169410.0,9351441.0,9543205.0,9744781.0,9956320.0,...,29185507.0,30117413.0,31161376.0,32269589.0,33370794.0,34413603.0,35383128.0,36296400.0,37172386.0,
2,Angola,AGO,"Population, total",SP.POP.TOTL,5454933.0,5531472.0,5608539.0,5679458.0,5735044.0,5770570.0,...,23356246.0,24220661.0,25107931.0,26015780.0,26941779.0,27884381.0,28842484.0,29816748.0,30809762.0,
3,Albania,ALB,"Population, total",SP.POP.TOTL,1608800.0,1659800.0,1711319.0,1762621.0,1814135.0,1864791.0,...,2913021.0,2905195.0,2900401.0,2895092.0,2889104.0,2880703.0,2876101.0,2873457.0,2866376.0,
4,Andorra,AND,"Population, total",SP.POP.TOTL,13411.0,14375.0,15370.0,16412.0,17469.0,18549.0,...,84449.0,83747.0,82427.0,80774.0,79213.0,78011.0,77297.0,77001.0,77006.0,


### Data Cleaning

We will clean the data in order to reduce the chance of errors. Here, we check which columns are useful and which are not.

In [7]:
print(df.columns)

Index(['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code',
       '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
       '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
       '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
       '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
       '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015', '2016', '2017', '2018', 'Unnamed: 63'],
      dtype='object')


We will remove the features that do not help with the analysis and these are the `Country Code`, `Indicator Name`,`Indicator Code` and `Unnamed: 63`

In [8]:
columns_to_drop=['Country Code', 'Indicator Name','Indicator Code', 'Unnamed: 63']
df.drop(columns=columns_to_drop, inplace=True )

df