## <span style="color:green">Introduction to Pandas
    
Pandas is a Python library used for data manipulation and analysis. Pandas provides a convenient way to analyze and clean data.

The Pandas library introduces two new data structures to Python - Series and DataFrame, both of which are built on top of NumPy.

### <span style="color:blue">What Can Pandas Do?
Pandas gives you answers about the data. Like:

- Is there a correlation between two or more columns?
- What is average value?
- Max value?
- Min value?

Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or NULL values. This is called cleaning the data.



### <span style="color:blue">What is Pandas Used for?
Pandas is a powerful library generally used for:
    
1. Data Cleaning
2. Data Transformation
3. Data Analysis
4. Machine Learning
5. Data Visualization

### <span style="color:blue">Installation of Pandas
    
If you have Python and PIP already installed on a system, then installation of Pandas is very easy.

Install it using this command:

In [1]:
!pip install pandas





### <span style="color:blue">Importing Libraries and Aliasing
    
Libraries in Python provide additional functionality. We import them using the import keyword and can give them aliases for convenience.

In [None]:
import pandas as pd  # Importing the pandas library and aliasing it as 'pd'


### <span style="color:blue">Reading CSV Files
    
read_csv() is used to read a comma-separated values (CSV) file into a DataFrame.
    

In [None]:
data = pd.read_csv("car_information.csv")


### <span style="color:blue">DataFrames and Creation
    
A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.



In [None]:
data = pd.DataFrame({
    "Car Model": ["Model S", "Corolla", "Mustang", "Civic", "Camry", "3 Series", "A4", "Accord", "Model 3", "Altima", "C-Class", "Malibu", "Impreza", "Charger", "Escape", "Highlander", "Cherokee", "Outback", "CX-5", "Q5", "GLC", "X3", "Rogue", "Tucson", "Forester"],
    "Manufacturer": ["Tesla", "Toyota", "Ford", "Honda", "Toyota", "BMW", "Audi", "Honda", "Tesla", "Nissan", "Mercedes-Benz", "Chevrolet", "Subaru", "Dodge", "Ford", "Toyota", "Jeep", "Subaru", "Mazda", "Audi", "Mercedes-Benz", "BMW", "Nissan", "Hyundai", "Subaru"],
    "Year": [2021, 2020, 2019, 2018, 2021, 2020, 2019, 2018, 2021, 2020, 2020, 2019, 2018, 2021, 2020, 2021, 2019, 2020, 2021, 2021, 2020, 2019, 2020, 2021, 2020],
    "Price ($)": [79990, 20000, 55000, 22000, 25000, 41000, 39000, 24000, 39990, 24000, 41000, 22000, 19000, 29995, 25000, 35000, 27000, 27000, 26000, 44000, 43000, 42000, 27000, 25000, 25000],
    "Horsepower": [670, 139, 450, 158, 203, 255, 248, 192, 283, 188, 255, 160, 152, 292, 180, 295, 180, 182, 187, 261, 255, 248, 170, 187, 182]
})


### <span  style = "color:blue">Indexing
    
Indexing allows for accessing specific rows and columns of the DataFrame.

In [None]:
print(data["Car Model"])  # Accessing the 'Car Model' column


### <span style="color:blue">Locate Row
    
The loc function is used to access a group of rows and columns by labels or a boolean array.

In [None]:
row = data.loc[2]  # Accessing the third row (index 2)
print(row)


### <span style="color:blue">Head
    
The head() function returns the first n rows of the DataFrame.

In [None]:
print(data.head())  # Displaying the first 5 rows


<span style="color:red">Print first 10 rows of the DataFrame

In [None]:
print(data.head(10))

### <span style="color:blue">Tail
The tail() function returns the last n rows of the DataFrame.

In [None]:
print(data.tail())  # Displaying the last 5 rows


<span style="color:red">Print Last 10 rows of the DataFrame

In [None]:
print(data.tail(10))

### <span style="color:blue">Info
The info() function prints a concise summary of the DataFrame.

In [None]:
print(data.info())


### <span style="color:blue">Remove Rows - dropna()
The dropna() function removes missing values.

In [None]:
data_cleaned = data.dropna()  # Removing rows with any missing values
print(data_cleaned)


### <span style="color:blue">Fill Null - fillna()
The fillna() function replaces missing values with a specified value.

In [None]:
data_filled = data.fillna(0)  # Filling missing values with 0
print(data_filled)


### <span style="color:blue">Mean
The mean() function calculates the mean of the numeric columns.

In [None]:
print(data["Price ($)"].mean())  # Calculating the mean price


### <span style="color:blue">Median
The median() function calculates the median of the numeric columns.

In [None]:
print(data["Horsepower"].median())  # Calculating the median horsepower


### <span style="color:blue">Mode
The mode() function calculates the mode of the numeric columns.

In [None]:
print(data["Year"].mode())  # Calculating the mode year


### <span style="color:blue">Fixing Wrong Data
We can fix wrong data by assigning correct values to specific cells.

In [None]:
data.loc[data["Car Model"] == "Model S", "Price ($)"] = 79999  # Fixing the price of Model S
print(data.loc[data["Car Model"] == "Model S"])


### <span style="color:blue">Remove Duplicate - drop_duplicates()
The drop_duplicates() function removes duplicate rows.

In [None]:
data_unique = data.drop_duplicates()
print(data_unique)


### <span style ="color:blue">Querying a Dataframe in Python

#### <span style="color:red">1.Select all cars from 2021:

In [None]:
cars_2021 = data[data['Year'] == 2021]
print("\nCars from 2021:")
print(cars_2021)


#### <span style="color:red">2.Select cars with a price greater than $30,000:

In [None]:
expensive_cars = data[data['Price ($)'] > 30000]
print("\nCars with a price greater than $30,000:")
print(expensive_cars)


#### <span style="color:red">3.Select cars by a specific manufacturer (e.g., Tesla):

In [None]:
tesla_cars = data[data['Manufacturer'] == 'Tesla']
print("\nTesla cars:")
print(tesla_cars)


#### <span style="color:red">4.Select cars with horsepower greater than 200:

In [None]:
high_hp_cars = data[data['Horsepower'] > 200]
print("\nCars with horsepower greater than 200:")
print(high_hp_cars)


#### <span style="color:red">5.Select specific columns (e.g., Car Model and Price) for cars from 2020:

In [None]:
cars_2020_specific_columns = data[data['Year'] == 2020][['Car Model', 'Price ($)']]
print("\nCar Model and Price for cars from 2020:")
print(cars_2020_specific_columns)


#### <span style="color:red">6.Count the number of cars by manufacturer:

In [None]:
manufacturer_counts = data['Manufacturer'].value_counts()
print("\nNumber of cars by manufacturer:")
print(manufacturer_counts)


#### <span style="color:red">7.Average price of cars by year:

In [None]:
average_price_by_year = data.groupby('Year')['Price ($)'].mean()
print("\nAverage price of cars by year:")
print(average_price_by_year)


#### <span style="color:red">8.Maximum horsepower of cars by manufacturer:

In [None]:
max_hp_by_manufacturer = data.groupby('Manufacturer')['Horsepower'].max()
print("\nMaximum horsepower of cars by manufacturer:")
print(max_hp_by_manufacturer)


### <span style="color:blue">PRACTICE QUESTIONS

## <span style="color:blue">Libraries: Import and Alias

Libraries in Python are collections of precompiled routines that a program can use. The pandas library is widely used for data manipulation and analysis. It's often imported with the alias pd.

In [None]:
import pandas as pd


## <span style="color:blue">Reading CSV Data

The read_csv() function in pandas is used to read a comma-separated values (CSV) file into a DataFrame.

In [None]:
df = pd.read_csv("uncleaned_city_data.csv")
print(df)


## <span style="color:blue">DataFrames and Creation

A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. It’s similar to a table in a database or a data frame in R.

In [None]:
data = {
    "City": ["Tokyo", "New York", "Los Angeles", "London", "Paris"],
    "Country": ["Japan", "USA", "USA", "UK", "France"],
    "Population (millions)": [37.4, 8.4, 4, 9, 2.1],
    "Area (sq km)": [2191, 783.8, 1302, 1572, 105.4],
    "GDP per Capita ($)": [40000, 70000, 65000, 55000, 60000]
}
df_example = pd.DataFrame(data)
print(df_example)


## <span style="color:blue">Indexing

Indexing in pandas means selecting particular rows and columns of data from a DataFrame. This can be done using labels or positional indexing.

In [None]:
# Select the 'City' column
city_column = df['City']
print(city_column)

# Select the first row
first_row = df.iloc[0]
print(first_row)


## <span style="color:blue">Locate Row
The loc method is used to access a group of rows and columns by labels or a boolean array.

In [None]:
# Locate rows where the country is 'USA'
usa_cities = df.loc[df['Country'] == 'USA']
print(usa_cities)


## <span style="color:blue">Head

The head() function returns the first n rows of a DataFrame.

In [None]:
print(df.head())


<span style="color:red">Print first 10 rows of the DataFrame

In [None]:
print(df.head(10))


## <span style="color:blue">Tail
The tail() function returns the last n rows of a DataFrame.

In [None]:
print(df.tail())


<span style="color:red">Print last 10 rows of the DataFrame

In [None]:
print(df.tail(10))

## <span style="color:blue">Info
The info() function provides a summary of the DataFrame including the data types and non-null values.

In [None]:
print(df.info())


## <span style="color:blue">Remove Rows - dropna()
The dropna() function removes missing values from the DataFrame.

In [None]:
df_cleaned = df.dropna()
print(df_cleaned)


## <span style="color:blue">Fill Null - fillna()
The fillna() function replaces missing values with a specified value.

In [None]:
df_filled = df.fillna(0)
print(df_filled)


## <span style="color:blue">Mean
The mean() function calculates the average of numerical columns in the DataFrame.

In [None]:
mean_population = df['Population (millions)'].mean()
print(f"Mean Population: {mean_population}")


## <span style="color:blue">Median
The median() function calculates the median of numerical columns in the DataFrame.

In [None]:
median_population = df['Population (millions)'].median()
print(f"Median Population: {median_population}")


## <span style="color:blue">Mode
The mode() function returns the mode(s) of numerical columns in the DataFrame.

In [None]:
mode_population = df['Population (millions)'].mode()
print(f"Mode Population: {mode_population}")


## <span style="color:blue">Fixing Wrong Data
Fixing wrong data involves identifying and correcting errors in the dataset. This can be done by applying conditional statements.

In [None]:
# Replace negative values in the 'GDP per Capita ($)' column with the column mean
mean_gdp = df['GDP per Capita ($)'].mean()
df['GDP per Capita ($)'] = df['GDP per Capita ($)'].apply(lambda x: mean_gdp if x < 0 else x)
print(df)


## <span style="color:blue">Remove Duplicate - drop_duplicates()
The drop_duplicates() function removes duplicate rows from the DataFrame.

In [None]:
df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)


## <span style="color:blue">Querying dataframe Using Pandas

#### <span style=color:red> 1.Select all cities in Japan:

In [None]:
cities_in_japan = df[df['Country'] == 'Japan']
print("\nCities in Japan:")
print(cities_in_japan)


#### <span style=color:red>2.Select cities with a population greater than 10 million:

In [None]:
large_cities = df[df['Population (millions)'] > 10]
print("\nCities with a population greater than 10 million:")
print(large_cities)


#### <span style=color:red>3.Select cities with a GDP per Capita greater than $50,000:

In [None]:
high_gdp_cities = df[df['GDP per Capita ($)'] > 50000]
print("\nCities with GDP per Capita greater than $50,000:")
print(high_gdp_cities)


#### <span style=color:red>4.Select specific columns (e.g., City and Country) for cities in the USA:

In [None]:
usa_cities_specific_columns = df[df['Country'] == 'USA'][['City', 'Country']]
print("\nCity and Country for cities in the USA:")
print(usa_cities_specific_columns)


#### <span style=color:red>5.Average population of cities by country:

In [None]:
average_population_by_country = df.groupby('Country')['Population (millions)'].mean()
print("\nAverage population of cities by country:")
print(average_population_by_country)


#### <span style=color:red>6.Count the number of cities by country:

In [None]:
city_counts_by_country = df['Country'].value_counts()
print("\nNumber of cities by country:")
print(city_counts_by_country)


#### <span style=color:red>7.Maximum GDP per Capita of cities by country:

In [None]:
max_gdp_by_country = df.groupby('Country')['GDP per Capita ($)'].max()
print("\nMaximum GDP per Capita of cities by country:")
print(max_gdp_by_country)


# <span style="color:blue">----END OF THE SESSION------

<span styke>

<span style="color:white">

## <span style="color:RED"> ASSIGNMENT

<span style="color:red">1. How do you import pandas and assign an alias to it?


<span style="color:red">2. How do you read a CSV file into a DataFrame using pandas?

<span style="color:red">3. Create a DataFrame in pandas.

<span style="color:red">4. Display the first 15 rows of a DataFrame.

<span style="color:red">5. Display the last 3 rows of a DataFrame.

<span style="color:red">6. Remove duplicate rows from the DataFrame.

<span style="color:red">7. Find the datatype of each column

<span style="color:red">8. List all Animal whose lifespan is more than a certain age

<span style="color:red">9. Enter a habitat (e.g., Forest, Desert, Ocean) and then displays information about animals belonging to that habitat.

<span style="color:red">10. Select all endangered animals:

<span style="color:red">11. Count the number of animals by type:

<span style="color:red">12. Average lifespan of animals by habitat:

<span style="color:red">13. Maximum lifespan of animals by diet: