# CSV Data Cleaning and Analytics

## Introduction

This notebook was created by [Jupyter AI](https://github.com/jupyterlab/jupyter-ai) with the following prompt:

> /generate a notebook to read a csv, do some basic data cleaning and descriptive analytics

The Jupyter notebook will focus on reading a CSV file, performing basic data cleaning, and conducting descriptive analytics. The notebook will start by importing necessary libraries, including pandas, for data manipulation and analysis. The CSV file will then be loaded into a DataFrame using the pandas library. The data will be explored by displaying the first few rows to understand its structure and contents. Data cleaning steps will include handling missing values, removing duplicate rows, and performing necessary data transformations. Descriptive analytics will be conducted to generate summary statistics, visualizations, and insights to better understand the dataset. Overall, the notebook will provide a comprehensive guide on how to work with CSV files, clean data, and perform descriptive analytics.

## Load the CSV file

In [1]:
!pip install pandas

Collecting pandas
  Downloading pandas-2.2.2-cp311-cp311-win_amd64.whl.metadata (19 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2024.1-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.2.2-cp311-cp311-win_amd64.whl (11.6 MB)
   ---------------------------------------- 0.0/11.6 MB ? eta -:--:--
   ---------------------------------------- 0.0/11.6 MB 660.6 kB/s eta 0:00:18
    --------------------------------------- 0.2/11.6 MB 3.0 MB/s eta 0:00:04
   ----- ---------------------------------- 1.6/11.6 MB 12.8 MB/s eta 0:00:01
   ---------- ----------------------------- 3.0/11.6 MB 19.0 MB/s eta 0:00:01
   ----------------- ---------------------- 5.1/11.6 MB 23.3 MB/s eta 0:00:01
   ------------------ --------------------- 5.2/11.6 MB 22.4 MB/s eta 0:00:01
   ---------------------- ----------------- 6.5/11.6 MB 20.9 MB/s eta 0:00:01
   ------------------------------ --------- 8.9/11.6 MB 24.7 MB/s eta 0:00:01
   ----------------------------------- ---- 10.3/1

In [1]:
import pandas as pd

In [2]:
import csv
import requests

url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"
response = requests.get(url)

with open('iris.csv', 'wb') as f:
   f.write(response.content)
   
with open('iris.csv') as csvfile:
  reader = csv.reader(csvfile)
  for row in reader:
    print(row)

['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
['5.1', '3.5', '1.4', '0.2', 'setosa']
['4.9', '3.0', '1.4', '0.2', 'setosa']
['4.7', '3.2', '1.3', '0.2', 'setosa']
['4.6', '3.1', '1.5', '0.2', 'setosa']
['5.0', '3.6', '1.4', '0.2', 'setosa']
['5.4', '3.9', '1.7', '0.4', 'setosa']
['4.6', '3.4', '1.4', '0.3', 'setosa']
['5.0', '3.4', '1.5', '0.2', 'setosa']
['4.4', '2.9', '1.4', '0.2', 'setosa']
['4.9', '3.1', '1.5', '0.1', 'setosa']
['5.4', '3.7', '1.5', '0.2', 'setosa']
['4.8', '3.4', '1.6', '0.2', 'setosa']
['4.8', '3.0', '1.4', '0.1', 'setosa']
['4.3', '3.0', '1.1', '0.1', 'setosa']
['5.8', '4.0', '1.2', '0.2', 'setosa']
['5.7', '4.4', '1.5', '0.4', 'setosa']
['5.4', '3.9', '1.3', '0.4', 'setosa']
['5.1', '3.5', '1.4', '0.3', 'setosa']
['5.7', '3.8', '1.7', '0.3', 'setosa']
['5.1', '3.8', '1.5', '0.3', 'setosa']
['5.4', '3.4', '1.7', '0.2', 'setosa']
['5.1', '3.7', '1.5', '0.4', 'setosa']
['4.6', '3.6', '1.0', '0.2', 'setosa']
['5.1', '3.3', '1.7', '0.5', 

In [3]:
# Load the CSV file into a DataFrame
df = pd.read_csv('iris.csv')

In [4]:
# Display the first few rows of the DataFrame to verify the data has been loaded successfully
print(df.head())

   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa


## Explore the data

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('iris.csv')

In [3]:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


## Data cleaning

In [5]:
import pandas as pd

In [7]:
# Read the CSV file into a DataFrame
df = pd.read_csv('iris.csv')

In [9]:
df.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')

In [10]:
# Check for missing values in the entire DataFrame
print(df.isnull().sum())

sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64


In [11]:
# Drop rows with missing values
df = df.dropna()

In [12]:
# Check for duplicate rows
duplicate_rows = df.duplicated().sum()
print("Duplicate rows:", duplicate_rows)

Duplicate rows: 1


In [13]:
# Drop duplicate rows
df = df.drop_duplicates()

In [14]:
df.groupby('species')['petal_length'].agg(['mean', 'median'])

Unnamed: 0_level_0,mean,median
species,Unnamed: 1_level_1,Unnamed: 2_level_1
setosa,1.462,1.5
versicolor,4.26,4.35
virginica,5.561224,5.6


In [12]:
# Display the cleaned DataFrame
print("Cleaned DataFrame:")
print(df.head(10))

Cleaned DataFrame:
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa
5           5.4          3.9           1.7          0.4  setosa
6           4.6          3.4           1.4          0.3  setosa
7           5.0          3.4           1.5          0.2  setosa
8           4.4          2.9           1.4          0.2  setosa
9           4.9          3.1           1.5          0.1  setosa


## Descriptive analytics

In [10]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

ModuleNotFoundError: No module named 'matplotlib'

In [None]:
df = pd.read_csv('cleaned_data.csv')

In [None]:
summary_stats = df.describe()
print(summary_stats)

In [None]:
correlation_matrix = df.corr()
print(correlation_matrix)

In [None]:
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(df['numerical_column'], kde=True)
plt.title('Distribution of Numerical Column')
plt.xlabel('Numerical Column')
plt.ylabel('Frequency')
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x='numerical_column_1', y='numerical_column_2', data=df)
plt.title('Relationship between Numerical Column 1 and Numerical Column 2')
plt.xlabel('Numerical Column 1')
plt.ylabel('Numerical Column 2')
plt.show()