## Pandas 

Python has many libraries and one of them is Pandas which is an open-source data manipulation and analysis library. It provides data structures for efficiently storing and manipulating large datasets, along with tools for reading and writing data in various formats.

The two primary data structures in pandas are:

DataFrame: A two-dimensional table of data with rows and columns. It is similar to a spreadsheet or SQL table, and you can think of it as a container for Series objects. Each column in a DataFrame is a Series, which is a one-dimensional labeled array.

Series: A one-dimensional labeled array capable of holding any data type. It can be thought of as a single column of a DataFrame.



Pandas provides a wide range of functionalities for data manipulation, including:

Reading and writing data from/to various file formats (CSV, Excel, SQL databases, etc.).
Cleaning and handling missing data.
Filtering, selecting, and indexing data.
Aggregating and summarizing data.
Merging and joining datasets.
Time series analysis.
Plotting and visualization.

In [63]:
# importing the required lbrary 
import pandas as pd

In [64]:
# the type of data structure
s1 = pd.Series([25, 30, 35, 40], name='age')
s2 = pd.Series(['New York', 'San Francisco', 'Los Angeles', 'Chicago'], name='city')

df = pd.concat([s1, s2], axis=1)
print(df)

   age           city
0   25       New York
1   30  San Francisco
2   35    Los Angeles
3   40        Chicago


Pandas is used to read and write .... using the methods read_csv, to_csv

In [82]:
# reading data on a csv
df = pd.read_csv('./birds (1).csv')
df.head(5)

Unnamed: 0,Name,ScientificName,Category,Order,Family,Genus,ConservationStatus,MinLength,MaxLength,MinBodyMass,MaxBodyMass,MinWingspan,MaxWingspan
0,Black-bellied whistling-duck,Dendrocygna autumnalis,Ducks/Geese/Waterfowl,Anseriformes,Anatidae,Dendrocygna,LC,47.0,56.0,652.0,1020.0,76.0,94.0
1,Fulvous whistling-duck,Dendrocygna bicolor,Ducks/Geese/Waterfowl,Anseriformes,Anatidae,Dendrocygna,LC,45.0,53.0,712.0,1050.0,85.0,93.0
2,Snow goose,Anser caerulescens,Ducks/Geese/Waterfowl,Anseriformes,Anatidae,Anser,LC,64.0,79.0,2050.0,4050.0,135.0,165.0
3,Ross's goose,Anser rossii,Ducks/Geese/Waterfowl,Anseriformes,Anatidae,Anser,LC,57.3,64.0,1066.0,1567.0,113.0,116.0
4,Greater white-fronted goose,Anser albifrons,Ducks/Geese/Waterfowl,Anseriformes,Anatidae,Anser,LC,64.0,81.0,1930.0,3310.0,130.0,165.0


Using the shape to check how large the dataframe is

In [69]:
df.shape

(443, 13)

to get the columns of dataframe using .columns

In [67]:
df.columns

Index(['Name', 'ScientificName', 'Category', 'Order', 'Family', 'Genus',
       'ConservationStatus', 'MinLength', 'MaxLength', 'MinBodyMass',
       'MaxBodyMass', 'MinWingspan', 'MaxWingspan'],
      dtype='object')

Reading data, showing  and understadning data using Data Exploration and Analysis

In [None]:
# reading data the top head() and the bottom tail()

df.head(5)

df.tail(5)

Get one or more columns

In [68]:
df[['ScientificName', 'Category']]

Unnamed: 0,ScientificName,Category
0,Dendrocygna autumnalis,Ducks/Geese/Waterfowl
1,Dendrocygna bicolor,Ducks/Geese/Waterfowl
2,Anser caerulescens,Ducks/Geese/Waterfowl
3,Anser rossii,Ducks/Geese/Waterfowl
4,Anser albifrons,Ducks/Geese/Waterfowl
...,...,...
438,Passerina caerulea,Cardinals/Allies
439,Passerina amoena,Cardinals/Allies
440,Passerina cyanea,Cardinals/Allies
441,Passerina ciris,Cardinals/Allies


Summary of data using the describe()

In [38]:
# statistics get to know the mean, quatile
df.describe()

Unnamed: 0,age
count,4.0
mean,32.5
std,6.454972
min,25.0
25%,28.75
50%,32.5
75%,36.25
max,40.0


In [74]:
df['MinLength'].mean()

28.53668171557562

In [80]:
df['Family'].unique()

array(['Anatidae', 'Odontophoridae', 'Phasianidae', 'Podicipedidae',
       'Columbidae', 'Cuculidae', 'Caprimulgidae', 'Apodidae',
       'Trochilidae', 'Rallidae', 'Gruidae', 'Recurvirostridae',
       'Charadriidae', 'Scolopacidae', 'Stercorariidae', 'Alcidae',
       'Laridae', 'Gaviidae', 'Procellariidae', 'Ciconiidae',
       'Fregatidae', 'Phalacrocoracidae', 'Pelecanidae', 'Ardeidae',
       'Threskiornithidae', 'Cathartidae', 'Pandionidae', 'Accipitridae',
       'Tytonidae', 'Strigidae', 'Alcedinidae', 'Picidae', 'Falconidae',
       'Tyrannidae', 'Laniidae', 'Vireonidae', 'Corvidae', 'Alaudidae',
       'Hirundinidae', 'Paridae', 'Sittidae', 'Certhiidae',
       'Troglodytidae', 'Polioptilidae', 'Cinclidae', 'Regulidae',
       'Muscicapidae', 'Turdidae', 'Mimidae', 'Sturnidae',
       'Bombycillidae', 'Passeridae', 'Motacillidae', 'Fringillidae',
       'Calcariidae', 'Passerellidae', 'Icteridae', 'Parulidae',
       'Cardinalidae'], dtype=object)

# Missing Values
Missing Values and Fixing Missing values NaN(not a number)

to identify missing values .isnull(),.notnull() methods

In [84]:
df.isnull().sum()

Name                  0
ScientificName        0
Category              0
Order                 0
Family                0
Genus                 0
ConservationStatus    0
MinLength             0
MaxLength             0
MinBodyMass           0
MaxBodyMass           0
MinWingspan           0
MaxWingspan           0
dtype: int64

To get concise summary of the Dataframe includng the number of non  nul vaues

In [85]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 443 entries, 0 to 442
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Name                443 non-null    object 
 1   ScientificName      443 non-null    object 
 2   Category            443 non-null    object 
 3   Order               443 non-null    object 
 4   Family              443 non-null    object 
 5   Genus               443 non-null    object 
 6   ConservationStatus  443 non-null    object 
 7   MinLength           443 non-null    float64
 8   MaxLength           443 non-null    float64
 9   MinBodyMass         443 non-null    float64
 10  MaxBodyMass         443 non-null    float64
 11  MinWingspan         443 non-null    float64
 12  MaxWingspan         443 non-null    float64
dtypes: float64(6), object(7)
memory usage: 45.1+ KB


To get the total number of values in each column, use sum() and any().any()

In [86]:
df.isnull().any().any()

False

# Handing Missing Values

1. droping Rows and Coumns

In [None]:
df.dropna(axis=1, inplace=True)

2. Filling the missing vakues with specific value, such as mean,meadian, or a contant

In [None]:
df['column_name'].fillna(df['column_name'].mean(), inplace=True)

3. Filling using Interpolation this work mostly on time serries data

In [None]:
df.interpolate(inplace=True)

4. using statistical imputation SimplerImputer class from scikit-learn

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')

df = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

5. For the cartegorical data we use mode