# <div style="text-align: center;">Data Visualization</div>

# Visualization
Visualization is a crucial aspect of data analysis and interpretation, as it allows for easy comprehension of complex data sets. It helps in identifying patterns, relationships, and trends that might not be apparent through raw data alone. Also used resurch publications to show the results of a research. 

Visualization libraries in Python enable users to create intuitive and interactive data visualizations that can effectively communicate insights to a broad audience. Some of the popular visualization libraries and frameworks in Python include Matplotlib, Plotly, Bokeh, and Seaborn. Each of these libraries has its own unique features and capabilities that cater to specific needs. 

# Matplotlib
Matplotlib is very flexible and customizable for creating plots. It does require a lot of code to make more basic plots with little customizations. When working in a setting where exploratory data analysis is the main goal, requiring many quickly drawn plots without as much emphasis on aesthetics, the library seaborn is a great option as it builds on top of Matplotlib to create visualizations more quickly.

# Seaborn
Built on top of Matplotlib, Seaborn is a well-known Python library for data visualization that offers a user-friendly interface for producing visually appealing and informative statistical graphics. **It is designed to work with Pandas dataframes**, making it easy to visualize and explore data quickly and effectively.

Seaborn offers a variety of powerful tools for visualizing data, including scatter plots, line plots, bar plots, heat maps, and many more. It also provides support for advanced statistical analysis, such as regression analysis, distribution plots, and categorical plots.
Seaborn's key benefit lies in its capability to generate attractive plots with minimal coding efforts. It provides a range of default themes and color palettes, which you can easily customize to suit your preferences. Additionally, Seaborn offers a range of built-in statistical functions, allowing users to easily perform complex statistical analysis with their visualizations.

Another notable feature of Seaborn is its ability to create complex multi-plot visualizations. With Seaborn, users can create grids of plots that allow for easy comparison between multiple variables or subsets of data. This makes it an ideal tool for exploratory data analysis and presentation.

Seaborn is a powerful and flexible data visualization library in Python that offers an easy-to-use interface for creating informative and aesthetically pleasing statistical graphics. It provides a range of tools for visualizing data, including advanced statistical analysis, and makes it easy to create complex multi-plot visualizations.

# Seaborn vs. Matplotlib
Python's two most widely used data visualization libraries are Matplotlib and Seaborn. While both libraries are designed to create high-quality graphics and visualizations, they have several key differences that make them better suited for different use cases.

One of the main differences between Matplotlib and Seaborn is their focus. Matplotlib is a low-level plotting library that provides a wide range of tools for creating highly customizable visualizations. It is a highly flexible library, allowing users to create almost any type of plot they can imagine. This flexibility comes at the cost of a steeper learning curve and more verbose code.

Seaborn, on the other hand, is a high-level interface for creating statistical graphics. It is built on top of Matplotlib and provides a simpler, more intuitive interface for creating common statistical plots. Seaborn is designed to work with Pandas dataframes, making it easy to create visualizations with minimal code. It also offers a range of built-in statistical functions, allowing users to easily perform complex statistical analyses with their visualizations.

Another key difference between Matplotlib and Seaborn is their default styles and color palettes. Matplotlib provides a limited set of default styles and color palettes, requiring users to customize their plots manually to achieve a desired look. Seaborn, on the other hand, offers a range of default styles and color palettes that are optimized for different types of data and visualizations. This makes it easy for users to create visually appealing plots with minimal customization.

While both libraries have their strengths and weaknesses, Seaborn is generally better suited for creating statistical graphics and exploratory data analysis, while Matplotlib is better suited for creating highly customizable plots for presentations and publications. However, it is worth noting that Seaborn is built on top of Matplotlib, and the two libraries can be used together to create complex, highly customizable visualizations that leverage the strengths of both libraries.

You can explore Matplotlib in more detail with our Introduction to Plotting with Matplotlib in Python tutorial.

Matplotlib and Seaborn are both powerful data visualization libraries in Python, with different strengths and weaknesses. Understanding the differences between the two libraries can help users choose the right tool for their specific data visualization needs.

## Matplotlib

### ✅ Visualising by define a dataset:

### ✅ Step 1: Install and Import Required Libraries

In [None]:
import matplotlib.pyplot as plt

### ✅ Step 2: Define the Dataset

In [None]:
# Time in hours
time = [0, 1, 2, 3, 4, 5, 6, 7, 8]

# Rainfall Hyetograph (Rainfall Intensity in mm/hr)
rainfall = [0, 5, 15, 25, 10, 5, 2, 0, 0]

# Runoff Hydrograph (Discharge in m³/s)
runoff = [0, 2, 10, 30, 45, 35, 20, 8, 2]


In [None]:
plt.plot(time, rainfall)
#plt.show() #you can also use this 

In [None]:
plt.plot(time, rainfall)
plt.plot(time, runoff)

### ✅ Visualisation using Data frame or an existing dataset(csv or xls file):

**Data cleaning** is the process of identifying and correcting (or removing) errors and inconsistencies in data to improve its quality. It's an essential step in the data analysis pipeline because raw data is often incomplete, incorrect, duplicated, or formatted inconsistently.

**Key Tasks in Data Cleaning:**
* Handling missing data: Filling in or removing missing values.

* Correcting errors: Fixing typos, inconsistent capitalization, or incorrect entries.

* Removing duplicates: Deleting repeated rows or values.

* Standardizing formats: Ensuring consistency in date formats, units of measure, etc.

* Filtering irrelevant data: Removing data that doesn't add value or is out of scope.

**Before Visualization our data we have to do data cleaning**

# <div style="text-align: center;">Data Cleaning</div>

In [None]:
import numpy as np
import pandas as pd

In [None]:
df=pd.read_csv('AQI_data.csv')
df

In [None]:
df.head(5)

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
#delete unnecessary columns or column with excessive null values

In [None]:
df.drop(columns=['Pollution Category2','Pollution Category3'])

In [None]:
df

In [None]:
# no permanent deletion, same df shape(268x27).use inplace=true for permanent deletion or assign to a new variable

In [None]:
df

In [None]:
df.info()

In [None]:
df=df[['Date','Dhaka','Chattogram','Gazipur']]

In [None]:
df.info()

In [None]:
df['Date']

In [None]:
df['Dhaka']

In [None]:
df['Chattogram']

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df

In [None]:
# Convert non-numeric values in 'Gazipur' column to NaN, then to numeric type

In [None]:
df['Gazipur'] = pd.to_numeric(df['Gazipur'], errors='coerce') #press shift+tab 
df

In [None]:
df.info()

In [None]:
df['Date'] = pd.to_datetime(df['Date'])

In [None]:
df['Date']

In [None]:
df.info()

In [None]:
df.head(10)

In [None]:
df.describe()

In [None]:
df.shape

## Handling Missing Values

### Option 1: Dropping or deleting missing data(row or column) using dropna()

In [None]:
df.dropna() #default axis=0, how=any, inplace=False)

In [None]:
df.dropna().shape

In [None]:
df.dropna(axis=0) #default,drop rows

In [None]:
df.dropna(axis=1) #drop columns

In [None]:
df.dropna(how='all') #drop rows having all null values in all columns

In [None]:
df.dropna(how='any') #default,drop rows having any null values in any of the columns 

In [None]:
df.shape

In [None]:
df.dropna(inplace=False) #default,does not modify the original DataFrame
df

In [None]:
df.shape

In [None]:
df.dropna(inplace=True) #Modifies the original DataFrame
df

In [None]:
df.shape

### Option 2: Filling missing data(row or column) using fillna()

In [None]:
df=pd.read_csv('AQI_data.csv')
df

In [None]:
df.isnull().sum()

In [None]:
#fill all the missing values of the df with a particular value or string

In [None]:
df.fillna(0)

In [None]:
df.fillna(0).isnull().sum()

In [None]:
df['Pollution Category2'].fillna('not specified')

In [None]:
df['Pollution Category2'].fillna('not specified', inplace=True)

In [None]:
df['Pollution Category2']

In [None]:
df

In [None]:
df['Pollution Category9'].head(10)

In [None]:
df['Pollution Category9'].fillna(method='bfill')

In [None]:
df['Pollution Category9'].fillna(method='ffill')

In [None]:
df=df[['Date','Dhaka','Chattogram','Gazipur']]
df

### Sorting

In [None]:
df = df.sort_values('Date')
df

### Basic Plot

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.plot([1, 2, 3, 4, 5], [4, 10, 6, 2, 4])
plt.show()

In [None]:
x= [1, 2, 3, 4, 5]
y= [4, 10, 6, 2, 4]
plt.plot(x,y)
plt.show()

In [None]:
plt.bar(x,y)

In [None]:
plt.pie()

In [None]:
df

In [None]:
plt.plot(df['Date'], df['Dhaka'])

In [None]:
plt.plot(df['Date'], df['Dhaka'])
plt.plot(df['Date'], df['Gazipur'])
plt.plot(df['Date'], df['Chattogram'])

In [None]:
df['Gazipur'] = pd.to_numeric(df['Gazipur'], errors='coerce') #press shift+tab 
df

In [None]:
df.dropna(inplace=True)
df

In [None]:
df['Date'] = pd.to_datetime(df['Date'])
df

In [None]:
plt.plot(df['Date'], df['Dhaka'])
plt.plot(df['Date'], df['Gazipur'])
plt.plot(df['Date'], df['Chattogram'])

In [None]:
df = df.sort_values('Date')
df

In [None]:
plt.plot(df['Date'], df['Dhaka'])
plt.plot(df['Date'], df['Gazipur'])
plt.plot(df['Date'], df['Chattogram'])

In [None]:
plt.plot(df['Date'], df['Dhaka'], marker='o', color='blue', linestyle= '-', 
         linewidth=1, markersize=4, label='AQI')  # Plot a line with circular markers for runoff
plt.title('Time Series Analysis of Air Quality Index(AQI) of Dhaka City ')                                           # Set the title of the plot
plt.xlabel('Time (hours)')                                               # Label the x-axis
plt.ylabel('Air Quality Index(AQI) ')                                           # Label the y-axis
plt.grid(True)                                                           # Add gridlines to the plot
plt.legend()                                                             # Show the legend for labeling
plt.show()                                                               # Display the plot, optio

In [None]:
plt.plot(df['Date'], df['Dhaka'], marker='D', color='red', linestyle= '--', 
         linewidth=1, markersize=4, label='AQI')  # Plot a line with circular markers for runoff
plt.title('Time Series Analysis of Air Quality Index(AQI) of Dhaka City ')                                           # Set the title of the plot
plt.xlabel('Time (hours)')                                               # Label the x-axis
plt.ylabel('Air Quality Index(AQI) ')                                           # Label the y-axis
plt.grid(True)                                                           # Add gridlines to the plot
plt.legend()                                                             # Show the legend for labeling
plt.show()                                                               # Display the plot, optio

In [None]:
plt.plot(df['Date'], df['Dhaka'], marker='o', color='blue', linestyle= '-', 
         linewidth=1, markersize=4, label='AQI')  # Plot a line with circular markers for runoff
plt.title('Time Series Analysis of Air Quality Index(AQI) of Dhaka City ')                                           # Set the title of the plot
plt.xlabel('Time (hours)')                                               # Label the x-axis
plt.ylabel('Air Quality Index(AQI) ')                                           # Label the y-axis
plt.xticks(rotation=45)  # Rotate x-axis labels for clarity
plt.grid(True)                                                           # Add gridlines to the plot
plt.legend()                                                             # Show the legend for labeling
plt.show()                                                               # Display the plot, optio

In [None]:
plt.bar(df['Date'], df['Dhaka'], color='blue', label='AQI')  # Plot a line with circular markers for runoff
plt.title('Time Series Analysis of Air Quality Index(AQI) of Dhaka City ')                                           # Set the title of the plot
plt.xlabel('Time (hours)')                                               # Label the x-axis
plt.ylabel('Air Quality Index(AQI) ')                                           # Label the y-axis
plt.grid(True)                                                           # Add gridlines to the plot
plt.legend()                                                             # Show the legend for labeling
plt.show()                                                               # Display the plot, optio

In [None]:
plt.scatter(df['Date'], df['Dhaka'], color='green', label='Dhaka', s=20)  # Scatter plot of rainfall with green dots
plt.scatter(df['Date'], df['Chattogram'], color='red', label='Chattogram', s=20)

plt.title('Time Series Analysis of Air Quality Index(AQI) of Dhaka City ')                                           # Set the title of the plot
plt.xlabel('Time (hours)')
plt.xticks(rotation=90)# Label the x-axis
plt.ylabel('Air Quality Index(AQI)')                                           # Label the y-axis
plt.grid(True)                                                           # Add gridlines to the plot
plt.legend()                                                             # Show the legend for labeling
plt.show()   

In [None]:
#Histogram

In [None]:
plt.hist(df['Dhaka'], bins=10) #bins=number of bar. increasing number will reduce width of bar. 

In [None]:
plt.hist(df['Dhaka'], bins=20) #bins=number of bar. increasing number will reduce width of bar. 

In [None]:
plt.hist(df['Dhaka'], bins=10)
plt.title('Distubution of different AQI value throught out the month')                                           # Set the title of the plot
plt.xlabel('AQI')
plt.ylabel('Count')   

In [None]:
#Seaborn

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

In [None]:
#access built in datasets

In [None]:
sns.get_dataset_names()

In [None]:
df = sns.load_dataset("iris")
df

In [None]:
df = sns.load_dataset("titanic")
df

In [None]:
df=pd.read_csv('Weather_data.csv')
df

In [None]:
df.info()

In [None]:
df['Date/Time']= pd.to_datetime(df['Date/Time'])
df['Date/Time']

In [None]:
df['Date/Time'].info()

In [None]:
df

In [None]:
df = df.sort_values('Date/Time')
df

In [None]:
sns.countplot(x='Weather', data=df)
plt.xticks(rotation=90) 
plt.show()

In [None]:
sns.boxplot(x='Weather',y= 'Temp_C', data=df)
plt.xticks(rotation=90) 
plt.show()

In [None]:
sns.boxplot(x='Weather',y= 'Rel Hum_%', data=df)
plt.xticks(rotation=90) 
plt.show()

In [None]:
# Compute correlation matrix
corr = df.corr(numeric_only=True)

# Create heatmap
sns.heatmap(corr, annot=True, cmap="coolwarm", linewidths=0.5)
plt.title("Correlation Heatmap")
plt.show()

### ✅ Save Plot to File

In [None]:
corr = df.corr(numeric_only=True)

sns.heatmap(corr, annot=True, cmap="coolwarm", linewidths=0.5)
plt.title("Correlation Heatmap")

plt.savefig('Correlation Heatmap.png')  # Saves the graph as png file. saved to the current working directory
plt.show()

#plt.savefig('Correlation Heatmap.pdf')   # Saves a vector graphic PDF
#plt.savefig('Correlation Heatmap.svg')   # Saves a scalable SVG
