<a href="https://colab.research.google.com/github/nusratsadia/PCED/blob/main/Creating_Series_and_a_DataFrame.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# how to create a Series and a DataFrame using Pandas
import pandas as pd

# Create a Series
data = {'a': 10, 'b': 20, 'c': 30}
series = pd.Series(data)
print(series)



a    10
b    20
c    30
dtype: int64


In [None]:
# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 22],
        'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
print(df)

      Name  Age      City
0    Alice   25  New York
1      Bob   30    London
2  Charlie   22     Paris


# Key Points:

Series: A single column of data.
DataFrame: A collection of Series, representing a table with rows and columns.
These data structures provide efficient and flexible ways to work with tabular data in Python.

In [None]:
# drop missing values using dropna()
import pandas as pd

# Sample DataFrame with missing values
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, None, 22],
        'City': ['New York', None, 'Paris', 'London']}
df = pd.DataFrame(data)

# 1. Drop rows with any missing values
df_dropped_any = df.dropna()
print("Dropped rows with any missing values:\n", df_dropped_any)

# 2. Drop rows with all missing values
df_dropped_all = df.dropna(how='all')
print("\nDropped rows with all missing values:\n", df_dropped_all)

# 3. Drop rows where 'Age' is missing
df_dropped_age = df.dropna(subset=['Age'])
print("\nDropped rows where 'Age' is missing:\n", df_dropped_age)

# 4. Drop rows where 'City' is missing
df_dropped_city = df.dropna(subset=['City'])
print("\nDropped rows where 'City' is missing:\n", df_dropped_city)

Dropped rows with any missing values:
     Name   Age      City
0  Alice  25.0  New York
3  David  22.0    London

Dropped rows with all missing values:
       Name   Age      City
0    Alice  25.0  New York
1      Bob  30.0      None
2  Charlie   NaN     Paris
3    David  22.0    London

Dropped rows where 'Age' is missing:
     Name   Age      City
0  Alice  25.0  New York
1    Bob  30.0      None
3  David  22.0    London

Dropped rows where 'City' is missing:
       Name   Age      City
0    Alice  25.0  New York
2  Charlie   NaN     Paris
3    David  22.0    London


# Explanation:

dropna(): This is the core Pandas function for dropping missing values (represented by NaN).
how='any' (default): Drops rows that contain at least one missing value.
how='all': Drops rows only if all values in that row are missing.
subset: Specifies the columns to check for missing values.
This demonstrates how to use the dropna() function with different parameters to effectively handle missing values in your DataFrame.

#### Important Notes:

Dropping rows with missing values can lead to data loss, especially if a significant portion of your data contains missing values.
Consider other strategies for handling missing data, such as:
#### Imputation: Filling missing values with estimated values (e.g., mean, median, mode, or more sophisticated methods).
#### Data transformation: Creating new features or transforming existing features to handle missing values.
The best approach for handling missing values will depend on the specific characteristics of your data and the goals of your analysis.

In [None]:
# Remove duplicates using pandas
import pandas as pd

# Sample DataFrame with duplicate rows
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob'],
        'Age': [25, 30, 22, 25, 30]}
df = pd.DataFrame(data)

# Remove duplicate rows based on all columns
df_no_duplicates = df.drop_duplicates()
print("DataFrame with duplicates removed:\n", df_no_duplicates)

# Remove duplicate rows based on specific columns
df_no_duplicates_by_name = df.drop_duplicates(subset=['Name'])
print("\nDataFrame with duplicates removed based on 'Name':\n", df_no_duplicates_by_name)

DataFrame with duplicates removed:
       Name  Age
0    Alice   25
1      Bob   30
2  Charlie   22

DataFrame with duplicates removed based on 'Name':
       Name  Age
0    Alice   25
1      Bob   30
2  Charlie   22


In [None]:
# Calculating mean, median, standard deviation using numpy

'''This code demonstrates how to efficiently calculate these statistical measures using the NumPy library in Python.'''

import numpy as np

data = [10, 15, 20, 25, 30, 35, 40]

# Calculate mean
mean = np.mean(data)
print(f"Mean: {mean}")

# Calculate median
median = np.median(data)
print(f"Median: {median}")

# Calculate standard deviation
std_dev = np.std(data)
print(f"Standard Deviation: {std_dev}")

Mean: 25.0
Median: 25.0
Standard Deviation: 10.0


In [None]:
# mean, median, standard deviation using datafram

import pandas as pd

# Sample DataFrame
data = {'Values': [10, 15, 20, 25, 30, 35, 40]}
df = pd.DataFrame(data)

# Calculate mean
mean_value = df['Values'].mean()
print(f"Mean: {mean_value}")

# Calculate median
median_value = df['Values'].median()
print(f"Median: {median_value}")

# Calculate standard deviation
std_dev = df['Values'].std()
print(f"Standard Deviation: {std_dev}")

Mean: 25.0
Median: 25.0
Standard Deviation: 10.801234497346433


In [None]:
import pandas as pd

# Sample DataFrame with duplicate rows
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 22]}
df = pd.DataFrame(data)

In [None]:
# Example 1: Creating a Salary Column in existing DataFrame with missing values
df['Salary'] = [50000, 60000, None]

print("df with missing values:\n", df)

df with missing values:
       Name  Age   Salary
0    Alice   25  50000.0
1      Bob   30  60000.0
2  Charlie   22      NaN


In [None]:
# Example 2: Filling missing values
#df['Salary'] = df['Salary'].fillna(df['Salary'].mean(), inplace=True)
df['Salary']=df['Salary'].fillna(df['Salary'].mean())
print("df with filled values:\n", df)

print("Print only salary Column")
print("Salary Column with filled values:\n", df['Salary'])


df with filled values:
       Name  Age   Salary
0    Alice   25  50000.0
1      Bob   30  60000.0
2  Charlie   22  55000.0
Print only salary Column
Salary Column with filled values:
 0    50000.0
1    60000.0
2    55000.0
Name: Salary, dtype: float64


In [16]:
# Example 2: Filling missing values
df['Salary'].fillna(value=df['Salary'].mean())
print("df with filled values:\n", df)



df with filled values:
       Name  Age   Salary
0    Alice   25  50000.0
1      Bob   30  60000.0
2  Charlie   22  55000.0


In [17]:
# Example 2: Filling missing values
df['Salary'].fillna(df['Salary'].mean())
print("df with filled values:\n", df)

df with filled values:
       Name  Age   Salary
0    Alice   25  50000.0
1      Bob   30  60000.0
2  Charlie   22  55000.0
