# **Pandas Day 5**

# **Population of Pakistan Dataset**

Author Name : Muhammad Ishfaq Khan\
Email : ishfaqkhanniaxii@gmail.com

Data Download Link : [Population of Pakistan](https://www.kaggle.com/datasets/mabdullahsajid/population-of-pakistan-dataset)

### **About Dataset**

`Description`\
This dataset contains demographic information from the Pakistan Population Census conducted in 2017. It provides detailed population data at various administrative levels within Pakistan, including provinces, divisions, districts, and sub-divisions. The dataset also includes information on urban and rural populations, gender distribution, transgender individuals, sex ratios, population figures from the 1998 census, and annual growth rates.

`Features`\
Province: The administrative provinces or regions of Pakistan where the census data was collected.

Division: The divisions within each province. Divisions are the second level of administrative divisions in Pakistan.

District: Districts within each division, representing larger administrative units.

Sub-Division: Sub-divisions or tehsils within each district, providing more localized data.

Area: The land area (in square kilometers) of each sub-division.

Urban Population 2017: The population of urban areas within each sub-division for the year 2017.

Rural Population 2017: The population of rural areas within each sub-division for the year 2017.

Male Population 2017: The male population within each sub-division for the year 2017.

Female Population 2017: The female population within each sub-division for the year 2017.

Transgender Population 2017: The population of transgender individuals within each sub-division for the year 2017.

Sex Ratio 2017: The sex ratio, calculated as the number of females per 1000 males, within each sub-division for the year 2017.

Population in 1998: The total population of each sub-division as recorded in the 1998 census.

Annual Growth Rate: The annual growth rate of the population in each sub-division, calculated as the percentage increase from 1998 to 2017.

`Data Source`\
The data in this dataset was collected from official Pakistan Population Census reports and may include data from various government sources. It is essential to provide proper attribution and reference the original sources when using this dataset for analysis or research.

`Data Usage`\
Researchers and analysts can use this dataset to explore demographic trends, population growth, urbanization rates, gender distribution, and more within Pakistan at different administrative levels. Ensure compliance with ethical and legal guidelines when using this data for research or public sharing.

In [None]:
# import libraries

import pandas as pd  
import seaborn as sns  
import matplotlib.pyplot as plt  
import numpy as np  

In [None]:
# import own dataset 

df = pd.read_csv('../python_datasets/population_of_pakistan.csv')
print(df.head())

In [None]:
# Explore the data (Composition)

# df.info()
df.head()

# pd.set_option('display.max_columns', None)  # Show all columns in the DataFrame
# print(df.head())

In [None]:
df.columns # columns name of the all dataframe
df.dtypes # check the datatypes of the all the columns in the dataframe
df.describe() # check the summary of dataframe

# take the transpose of the summary of tha dataframe

df.describe().T

In [None]:
# check the missing values in the dataframe and handled these missing values

df.isnull().sum() 
# there is no missing values in that dataframe so that dataframe is good for analysis
# if any missing values in the that dataframe, first we handled that missing values and then perform analysis on it 

In [None]:
print(df['AREA (sq.km)'].dtype)

In [None]:
# make the barplot for visualtization 

# do plots ko kabhe bhe ak cell mai na chlo issues atai hai

# sns.boxplot(df, y='AREA (sq.km)') 
sns.boxplot(df, x='PROVINCE', y='AREA (sq.km)')


In [None]:
# make the histplot for visualization

# do plots ko kabhe bhe ak cell mai na chlo issues atai hai

# sns.histplot(df, x='AREA (sq.km)')
sns.histplot(df, x='PROVINCE', y='FEMALE (RURAL)')

In [71]:
df.columns # names of all the column of dataframes

Index(['PROVINCE', 'DIVISION', 'DISTRICT', 'SUB DIVISION', 'AREA (sq.km)',
       'ALL SEXES (RURAL)', 'MALE (RURAL)', 'FEMALE (RURAL)',
       'TRANSGENDER (RURAL)', 'SEX RATIO (RURAL)',
       'AVG HOUSEHOLD SIZE (RURAL)', 'POPULATION 1998 (RURAL)',
       'ANNUAL GROWTH RATE (RURAL)', 'ALL SEXES (URBAN)', 'MALE (URBAN)',
       'FEMALE (URBAN)', 'TRANSGENDER (URBAN)', 'SEX RATIO (URBAN)',
       'AVG HOUSEHOLD SIZE (URBAN)', 'POPULATION 1998 (URBAN)',
       'ANNUAL GROWTH RATE (URBAN)'],
      dtype='object')

In [None]:
# if hard to make the plot then using groupby function to handle this

df['PROVINCE'].unique() # values of columns 
df['PROVINCE'].nunique() # no of values of columns 

In [None]:
# using groupby function to handle data in groups

df.groupby('DIVISION').size()
df.groupby(['PROVINCE', 'DIVISION']).size()
df.groupby('DIVISION')['FEMALE (RURAL)'].mean()
df.groupby(['PROVINCE', 'DIVISION'])['FEMALE (RURAL)'].mean()

In [73]:
# check the sum 

f_rural = df['FEMALE (RURAL)'].sum()
f_urban = df['FEMALE (URBAN)'].sum()
print('Female Rural Population : ', f_rural)
print('Female Urban Population : ', f_urban)
print('Difference btw urban and rural female population : ', f_rural - f_urban)

Female Rural Population :  63879631
Female Urban Population :  35902873
Difference btw urban and rural female population :  27976758


### **Assignment No 1 : Precentage Calculate**

In [76]:
pop_urban_2017 = df['ALL SEXES (URBAN)'].sum()
pop_urban_1998 = df['POPULATION 1998 (URBAN)'].sum()

print('Population Urban 2017 : ', pop_urban_2017)
print('Population Urban 1998 : ', pop_urban_1998)

print('The total population increased in Urban areas form 1998 to 2017 : ', pop_urban_2017 - pop_urban_1998)
print('The total precentage of population ncreased in Urban areas form 1998 to 2017 : ', (pop_urban_2017 - pop_urban_1998)/pop_urban_1998 * 100)

Population Urban 2017 :  74375943
Population Urban 1998 :  42316331
The total population increased in Urban areas form 1998 to 2017 :  32059612
The total precentage of population ncreased in Urban areas form 1998 to 2017 :  75.76179513294761


### **Assignment No 2: How to combine 3 columns in one column**

In [77]:
import pandas as pd

# Sample data
df = pd.DataFrame({
    'A': ['John', 'Jane', 'Doe'],
    'B': ['Smith', 'Doe', 'Brown'],
    'C': ['NY', 'LA', 'SF']
})

# Combine columns A, B, C into one column with space separator
df['Combined'] = df['A'] + ' ' + df['B'] + ' ' + df['C']

print(df)


      A      B   C       Combined
0  John  Smith  NY  John Smith NY
1  Jane    Doe  LA    Jane Doe LA
2   Doe  Brown  SF   Doe Brown SF


In [78]:
# Combine columns using aggregate function

df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
})

# Sum across columns A, B, C for each row
df['Sum'] = df[['A', 'B', 'C']].agg('sum', axis=1)

print(df)


   A  B  C  Sum
0  1  4  7   12
1  2  5  8   15
2  3  6  9   18


In [79]:
# 3. Combine DataFrames using merge

df1 = pd.DataFrame({
    'ID': [1, 2, 3],
    'Name': ['John', 'Jane', 'Doe']
})

df2 = pd.DataFrame({
    'ID': [1, 2, 4],
    'City': ['NY', 'LA', 'SF']
})

# Merge on 'ID' column
merged_df = pd.merge(df1, df2, on='ID', how='inner')

print(merged_df)


   ID  Name City
0   1  John   NY
1   2  Jane   LA


In [80]:
# combined 2 dataframe using concat function 

import pandas as pd

# Sample DataFrames
df1 = pd.DataFrame({
    'A': [1, 2],
    'B': ['x', 'y']
})

df2 = pd.DataFrame({
    'A': [3, 4],
    'B': ['z', 'w']
})

# Concatenate vertically (rows-wise)
df_concat = pd.concat([df1, df2])

print(df_concat)


   A  B
0  1  x
1  2  y
0  3  z
1  4  w


In [81]:
# Notice karo ke index repeat ho raha hai. Agar chaho to ignore_index=True use kar sakte hain:

df_concat = pd.concat([df1, df2], ignore_index=True)
print(df_concat)


   A  B
0  1  x
1  2  y
2  3  z
3  4  w


In [82]:
# Horizontal concatenation (columns-wise):

df_concat_cols = pd.concat([df1, df2], axis=1)
print(df_concat_cols)


   A  B  A  B
0  1  x  3  z
1  2  y  4  w
