## Introduction to Pandas
- **What is Pandas?**
    - A powerful Python library for data manipulation and analysis
    - Built on top of NumPy, providing high-performance data structures
        - Ref : https://www.geeksforgeeks.org/numpy-cheat-sheet/
    - Installing pandas and importing the library

In [1]:
# install
%pip install pandas

Collecting pandas
  Downloading pandas-2.2.2-cp310-cp310-win_amd64.whl (11.6 MB)
     ---------------------------------------- 0.0/11.6 MB ? eta -:--:--
     --- ------------------------------------ 1.0/11.6 MB 30.7 MB/s eta 0:00:01
     ----------- ---------------------------- 3.4/11.6 MB 43.3 MB/s eta 0:00:01
     -------------------- ------------------- 6.1/11.6 MB 48.6 MB/s eta 0:00:01
     ------------------------------- -------- 9.0/11.6 MB 52.3 MB/s eta 0:00:01
     --------------------------------------  11.6/11.6 MB 59.8 MB/s eta 0:00:01
     --------------------------------------- 11.6/11.6 MB 50.4 MB/s eta 0:00:00
Collecting pytz>=2020.1
  Using cached pytz-2024.1-py2.py3-none-any.whl (505 kB)
Collecting tzdata>=2022.7
  Using cached tzdata-2024.1-py2.py3-none-any.whl (345 kB)
Collecting numpy>=1.22.4
  Downloading numpy-2.0.0-cp310-cp310-win_amd64.whl (16.5 MB)
     ---------------------------------------- 0.0/16.5 MB ? eta -:--:--
     ----- -------------------------------


[notice] A new release of pip is available: 23.0.1 -> 24.1
[notice] To update, run: c:\Users\turbo\.pyenv\pyenv-win\versions\3.10.11\python.exe -m pip install --upgrade pip


In [2]:
# import pandas
import pandas as pd
import numpy as np

## Key Data Structures
- Series: 1-dimensional labeled array
- DataFrame: 2-dimensional labeled data structure (like a table)

In [3]:
# Creating a Series
s = pd.Series([1, 3, 5, np.nan, 6, 8])

# Creating a DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

## Reading and Writing Data
- Reading CSV, Excel, JSON, SQL databases
- Writing data to different formats

In [23]:
# Reading a CSV file
df = pd.read_csv('data.csv')

In [24]:
print(df)

           Car       Model  Volume  Weight  CO2
0       Toyoty        Aygo    1000   790.0   99
1   Mitsubishi  Space Star    1200  1160.0   95
2        Skoda      Citigo    1000   929.0   95
3         Fiat         500     900   865.0   90
4         Mini      Cooper    1500  1140.0  105
5           VW         Up!    1000   929.0  105
6        Skoda       Fabia    1400  1109.0   90
7     Mercedes     A-Class    1500  1365.0   92
8         Ford      Fiesta    1500  1112.0   98
9         Audi          A1    1600  1150.0   99
10     Hyundai         I20    1100   980.0   99
11      Suzuki       Swift    1300   990.0  101
12        Ford      Fiesta    1000  1112.0   99
13       Honda       Civic    1600     NaN   94
14      Hundai         I30    1600  1326.0   97
15        Opel       Astra    1600  1330.0   97
16         BMW           1    1600  1365.0   99
17       Mazda           3    2200  1280.0  104
18       Skoda       Rapid    1600  1119.0  104
19        Ford       Focus    2000  1328

In [25]:
# you can simply type 'df' and run the cell to display the DataFrame
df

Unnamed: 0,Car,Model,Volume,Weight,CO2
0,Toyoty,Aygo,1000,790.0,99
1,Mitsubishi,Space Star,1200,1160.0,95
2,Skoda,Citigo,1000,929.0,95
3,Fiat,500,900,865.0,90
4,Mini,Cooper,1500,1140.0,105
5,VW,Up!,1000,929.0,105
6,Skoda,Fabia,1400,1109.0,90
7,Mercedes,A-Class,1500,1365.0,92
8,Ford,Fiesta,1500,1112.0,98
9,Audi,A1,1600,1150.0,99


In [26]:
# Writing to CSV
df.to_csv('output.csv', index=False)

## Data Inspection
- Viewing data: head(), tail(), info(), describe()
- Checking data types and missing values

In [27]:
df.head()

Unnamed: 0,Car,Model,Volume,Weight,CO2
0,Toyoty,Aygo,1000,790.0,99
1,Mitsubishi,Space Star,1200,1160.0,95
2,Skoda,Citigo,1000,929.0,95
3,Fiat,500,900,865.0,90
4,Mini,Cooper,1500,1140.0,105


In [28]:
df.tail()

Unnamed: 0,Car,Model,Volume,Weight,CO2
32,Ford,B-Max,1600,1235.0,104
33,BMW,216,1600,1390.0,108
34,Opel,Zafira,1600,1405.0,109
35,Mercedes,SLK,2500,1395.0,120
36,Audi,A6,2000,1725.0,114


In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Car     37 non-null     object 
 1   Model   37 non-null     object 
 2   Volume  37 non-null     int64  
 3   Weight  35 non-null     float64
 4   CO2     37 non-null     int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 1.6+ KB


In [30]:
df.describe()

Unnamed: 0,Volume,Weight,CO2
count,37.0,35.0,37.0
mean,1621.621622,1292.828571,102.351351
std,388.826679,244.443604,7.609264
min,900.0,790.0,90.0
25%,1500.0,1115.5,98.0
50%,1600.0,1330.0,99.0
75%,2000.0,1421.5,105.0
max,2500.0,1725.0,120.0


In [22]:
df.isnull().sum()

Car       0
Model     0
Volume    0
Weight    2
CO2       0
dtype: int64