# **Pandas Intermediate**

### **Import pandas**

In [1]:
import pandas as pd

### **Importing and Exporting Data**

Pandas supports reading from and writing to a variety of file formats, 
including CSV, Excel, SQL, making it easy to integrate with data analysis workflows.

In [2]:
# Import CSV to a Dataframe
csv_df = pd.read_csv("example.csv")
csv_df

Unnamed: 0,A,B,C
0,1.0,5.0,10.0
1,2.0,6.5,11.0
2,2.333333,6.5,12.0
3,4.0,8.0,11.0


### **Install openpyxl**

In [3]:
# Export Dataframe into an Excel Spreadsheet
%pip install openpyxl

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [7]:
csv_df.to_excel("exported_csv_df.xlsx", sheet_name="csv_df", index=False)

In [9]:
csv_df.to_csv("exported_csv_df.csv", index=False)

### **Importing SQL Database**

In [10]:
import sqlite3

## **Data Inspection** 

Data inspection is the initial review of a dataset to find missing values, 
incorrect data types, and gather basic statistics, providing insights into its quality and structure.

In [14]:
conn = sqlite3.connect("census_data.db")
census_df = pd.read_sql_query("SELECT * FROM individuals", conn)
census_df

Unnamed: 0,individual_id,age,workclass,fnlwgt,education,educational_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income
0,1,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,2,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,3,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,4,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,5,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,48838,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
48838,48839,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
48839,48840,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
48840,48841,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [15]:
census_df.isnull()

Unnamed: 0,individual_id,age,workclass,fnlwgt,education,educational_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
48838,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
48839,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
48840,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [16]:
census_df.isnull().sum()

individual_id      0
age                0
workclass          0
fnlwgt             0
education          0
educational_num    0
marital_status     0
occupation         0
relationship       0
race               0
gender             0
capital_gain       0
capital_loss       0
hours_per_week     0
native_country     0
income             0
dtype: int64

In [18]:
(census_df == "?").sum()

individual_id         0
age                   0
workclass          2799
fnlwgt                0
education             0
educational_num       0
marital_status        0
occupation         2809
relationship          0
race                  0
gender                0
capital_gain          0
capital_loss          0
hours_per_week        0
native_country      857
income                0
dtype: int64

In [19]:
# Check data type of each column inside dataframe
census_df.dtypes

individual_id       int64
age                 int64
workclass          object
fnlwgt             object
education          object
educational_num     int64
marital_status     object
occupation         object
relationship       object
race               object
gender             object
capital_gain        int64
capital_loss        int64
hours_per_week      int64
native_country     object
income             object
dtype: object

In [20]:
census_df.describe()

Unnamed: 0,individual_id,age,educational_num,capital_gain,capital_loss,hours_per_week
count,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0
mean,24421.5,38.643585,10.078089,1079.067626,87.502314,40.422382
std,14099.615261,13.71051,2.570973,7452.019058,403.004552,12.391444
min,1.0,17.0,1.0,0.0,0.0,1.0
25%,12211.25,28.0,9.0,0.0,0.0,40.0
50%,24421.5,37.0,10.0,0.0,0.0,40.0
75%,36631.75,48.0,12.0,0.0,0.0,45.0
max,48842.0,90.0,16.0,99999.0,4356.0,99.0


In [21]:
# Replace "?" into sparse data type
census_df.replace("?", pd.NA, inplace=True)
census_df

Unnamed: 0,individual_id,age,workclass,fnlwgt,education,educational_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income
0,1,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,2,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,3,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,4,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,5,18,,103497,Some-college,10,Never-married,,Own-child,White,Female,0,0,30,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,48838,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
48838,48839,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
48839,48840,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
48840,48841,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


## **Cleaning Data**

Cleaning data involves eliminating or rectifying inaccuracies, inconsistencies, 
and missing values within your dataset, utilizing techniques such as handling 
missing values via deletion or imputation, rectifying data types, and detecting 
and eliminating duplicate entries, ultimately resulting in more precise and dependable analysis.

### **Trimming and Cleaning Text Data**

In [None]:
occupation_mapping = {
    "Machine-op-inspct": "Machine Operator",
    "Farming-fishing": "Farming and Fishing",
    "Protective-serv": "Protective Services",
    "Other-service": "Other Service",
    "Prof-specialty": "Professional Specialty",
    "Craft-repair": "Craft Repair",
    "Adm-clerical": "Admin Clerical",
    "Exec-managerial":"Executive and Managerial",
    "Tech-support": "Tech Support",
    "Priv-house-serv":"Private Household Services",
    "Transport-moving":"Transportation and Moving",
    "Handlers-cleaners":"Handlers and Cleaners",
    "Armed-Forces": "Armed Forces"
}

### **Renaming columns and Reindexing**

### **Filtering and Selecting Data**

Filtering and selecting data are fundamental for focusing analysis on specific segments.

**Example**

1. Select individuals working more than 40 hours per week but earning '<=50K'.
2. Find divorced individuals in the Private sector.

### **Removing Columns and Rows**

## **Handling Duplicates**

Identifying and removing duplicate records are crucial for maintaining data quality.

### **Aggregating Data** (.groupby)

Aggregating data involves summarizing data points into meaningful statistics, 
such as averages, sums, or counts, which can be achieved using group by operations.