## Unemployment Analysis

### Import the libraries and load the dataset 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Loading the dataset and show the first 5 rows
unemployment_data = pd.read_csv('unemployment.csv')

unemployment_data.head()

Unnamed: 0,Region,Date,Frequency,Estimated Unemployment Rate (%),Estimated Employed,Estimated Labour Participation Rate (%),Area
0,Andhra Pradesh,31-05-2019,Monthly,3.65,11999139.0,43.24,Rural
1,Andhra Pradesh,30-06-2019,Monthly,3.05,11755881.0,42.05,Rural
2,Andhra Pradesh,31-07-2019,Monthly,3.75,12086707.0,43.5,Rural
3,Andhra Pradesh,31-08-2019,Monthly,3.32,12285693.0,43.97,Rural
4,Andhra Pradesh,30-09-2019,Monthly,5.17,12256762.0,44.68,Rural


### Exploratory Data Analysis

In [3]:
# Getting the shape of our dataset, total rows and total columns
unemployment_data.shape

(768, 7)

In [4]:
# Let's check the columns and their data types
unemployment_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 7 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   Region                                    740 non-null    object 
 1    Date                                     740 non-null    object 
 2    Frequency                                740 non-null    object 
 3    Estimated Unemployment Rate (%)          740 non-null    float64
 4    Estimated Employed                       740 non-null    float64
 5    Estimated Labour Participation Rate (%)  740 non-null    float64
 6   Area                                      740 non-null    object 
dtypes: float64(3), object(4)
memory usage: 42.1+ KB


In [5]:
# Statistical analysis
unemployment_data.describe()

Unnamed: 0,Estimated Unemployment Rate (%),Estimated Employed,Estimated Labour Participation Rate (%)
count,740.0,740.0,740.0
mean,11.787946,7204460.0,42.630122
std,10.721298,8087988.0,8.111094
min,0.0,49420.0,13.33
25%,4.6575,1190404.0,38.0625
50%,8.35,4744178.0,41.16
75%,15.8875,11275490.0,45.505
max,76.74,45777510.0,72.57


In [None]:
# checking for missing values
unemployment_data

#### Check for inconsistencies and fix them

In [12]:
# Turn column names into lower case
unemployment_data.columns = unemployment_data.columns.str.lower()

unemployment_data.head()

Unnamed: 0,region,date,frequency,estimated unemployment rate (%),estimated employed,estimated labour participation rate (%),area
0,Andhra Pradesh,31-05-2019,Monthly,3.65,11999139.0,43.24,Rural
1,Andhra Pradesh,30-06-2019,Monthly,3.05,11755881.0,42.05,Rural
2,Andhra Pradesh,31-07-2019,Monthly,3.75,12086707.0,43.5,Rural
3,Andhra Pradesh,31-08-2019,Monthly,3.32,12285693.0,43.97,Rural
4,Andhra Pradesh,30-09-2019,Monthly,5.17,12256762.0,44.68,Rural


In [13]:
# There are white spaces at the beginning of some columns, let's remove them.

unemployment_data.columns = unemployment_data.columns.str.strip()

unemployment_data.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 7 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   region                                   740 non-null    object 
 1   date                                     740 non-null    object 
 2   frequency                                740 non-null    object 
 3   estimated unemployment rate (%)          740 non-null    float64
 4   estimated employed                       740 non-null    float64
 5   estimated labour participation rate (%)  740 non-null    float64
 6   area                                     740 non-null    object 
dtypes: float64(3), object(4)
memory usage: 42.1+ KB


In [17]:
# We can see that the date is in the wrong format
# we should change it from object to datetime

unemployment_data['date'] = pd.to_datetime(unemployment_data['date'])

unemployment_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 7 columns):
 #   Column                                   Non-Null Count  Dtype         
---  ------                                   --------------  -----         
 0   region                                   740 non-null    object        
 1   date                                     740 non-null    datetime64[ns]
 2   frequency                                740 non-null    object        
 3   estimated unemployment rate (%)          740 non-null    float64       
 4   estimated employed                       740 non-null    float64       
 5   estimated labour participation rate (%)  740 non-null    float64       
 6   area                                     740 non-null    object        
dtypes: datetime64[ns](1), float64(3), object(3)
memory usage: 42.1+ KB


In [19]:
# Check for unique values in the region and area columns

unique_region = unemployment_data['region'].unique()
unique_area = unemployment_data['area'].unique()

print(f'The unique regions: {unique_region}\n The unique area: {unique_area}')

The unique regions: ['Andhra Pradesh' 'Assam' 'Bihar' 'Chhattisgarh' 'Delhi' 'Goa' 'Gujarat'
 'Haryana' 'Himachal Pradesh' 'Jammu & Kashmir' 'Jharkhand' 'Karnataka'
 'Kerala' 'Madhya Pradesh' 'Maharashtra' 'Meghalaya' 'Odisha' 'Puducherry'
 'Punjab' 'Rajasthan' 'Sikkim' 'Tamil Nadu' 'Telangana' 'Tripura'
 'Uttar Pradesh' 'Uttarakhand' 'West Bengal' nan 'Chandigarh']
 The unique area: ['Rural' nan 'Urban']


It looks like there are no inconsistent entries in the region and area columns 
but we can see that there are missing values. Let's now check for missing values

In [22]:
# Checking for missing values
unemployment_data.isnull().sum()


region                                     28
date                                       28
frequency                                  28
estimated unemployment rate (%)            28
estimated employed                         28
estimated labour participation rate (%)    28
area                                       28
dtype: int64

All the columns have the exact number of missing values, as ther are no additional information to justify this, 
it could be due to data collection or data entry issue with rows where all columns have missing values. 
Let's calculate the percentage of missing values and drop them if they're relatively low.

In [23]:
# Checking the percentage of missing values in each columns
missing_percent = unemployment_data.isnull().mean() * 100

print(missing_percent)

region                                     3.645833
date                                       3.645833
frequency                                  3.645833
estimated unemployment rate (%)            3.645833
estimated employed                         3.645833
estimated labour participation rate (%)    3.645833
area                                       3.645833
dtype: float64


In [25]:
# As the percentage of missing values are relatively low, dropping them will be a better option
unemployment_data.dropna()


Unnamed: 0,region,date,frequency,estimated unemployment rate (%),estimated employed,estimated labour participation rate (%),area
0,Andhra Pradesh,2019-05-31,Monthly,3.65,11999139.0,43.24,Rural
1,Andhra Pradesh,2019-06-30,Monthly,3.05,11755881.0,42.05,Rural
2,Andhra Pradesh,2019-07-31,Monthly,3.75,12086707.0,43.50,Rural
3,Andhra Pradesh,2019-08-31,Monthly,3.32,12285693.0,43.97,Rural
4,Andhra Pradesh,2019-09-30,Monthly,5.17,12256762.0,44.68,Rural
...,...,...,...,...,...,...,...
749,West Bengal,2020-02-29,Monthly,7.55,10871168.0,44.09,Urban
750,West Bengal,2020-03-31,Monthly,6.67,10806105.0,43.34,Urban
751,West Bengal,2020-04-30,Monthly,15.63,9299466.0,41.20,Urban
752,West Bengal,2020-05-31,Monthly,15.22,9240903.0,40.67,Urban
