### Pandas

In [0]:
#import library
import pandas as pd


In [0]:
salaries = pd.Series([50000, 60000, 70000, 80000])
display(salaries)


In [0]:
type(salaries)

###Follow-up Q: What is the difference between a list and a Series?

✅ Answer:
- A list is just a collection of values.
- A Series has index + values and supports vectorized operations, missing values, and metadata.

In [0]:
#Creating custom index in series
employee=pd.Series([50000,60000,70000],index=['Gourav','SDC','Steve'])
print(employee)

Follow-up Q: What is the use of custom index?

✅ Answer: It allows named lookups and simulates dictionary-like behavior:

In [0]:
print(employee['Gourav'])
print(employee['SDC'])


Series with Mixed Types

In [0]:
details=pd.Series(['Gourav',50000,'Data Engineer','Pune'])
print(details)

Follow-up Q: Can a Series hold different types?

✅ Answer: Yes, unlike NumPy arrays, a Pandas Series can hold mixed data types, but it defaults 
to the most generic type (e.g., object).
Apply Math on Series

In [0]:
salaries = pd.Series([50000, 60000, 70000])
updated_salaries = salaries * 1.10  # 10% hike
print(updated_salaries)

Handling Missing Valiues in Series

In [0]:
data=pd.Series([100,None,300])
print('Orignal Series\n',data)

In [0]:
print("Is Null:\n", data.isnull())

In [0]:
print("Fillna:\n", data.fillna(400))

🧠 Follow-up Q: Why does Pandas support None?

✅ Answer: In real data, missing values are common. Pandas treats None or np.nan as null and provides built-in tools to handle them.

##  DATAFRAME in PANDAS
A DataFrame is a 2D labeled table of rows and columns. Think of it as an in-memory Excel sheet or a SQL table.

In [0]:
data={
  'Name':['Gourav','SDC','Steve'],
  'Salary':[50000,60000,70000],
  'Designation':['Data Engineer','Data Scientist','Data Analyst'],
  'Location':['Kolkata','Kolkata','Chaicago']
      }

df= pd.DataFrame(data)
display(df)


🧠 Q: Why is dictionary-to-DataFrame conversion useful?

✅ A: Many real-world APIs or config files return data as dictionaries; converting them to DataFrames helps in processing.

Access Columns & Rows

In [0]:
display(df)

In [0]:
display(df['Salary'])


In [0]:
#row level access
#using label
print(df.loc[0])

In [0]:
#using index
print(df.iloc[1])

🧠 Q: Difference between .loc[] and .iloc[]?

✅ A:

.loc[] is label-based (row names/index)

.iloc[] is position-based (integer index)

In [0]:
df['Bonus']=df['Salary']*0.10
display(df)

**_Real world messy data cleaning_**

In [0]:
import pandas as pd
df = pd.read_csv('sales_messy.csv')
display(df)
# df =read_csv('sales_messy.csv',sep='|')

In [0]:
display(df.tail(3))

In [0]:
print("Null Counts: ",df.isnull().sum())

In [0]:
print("Datatypes:\n",df.dtypes)

In [0]:
display(df['product'])

In [0]:

#Applying the Function to remove extra spaces and Camel Case to Product
df['product']=df['product'].str.strip().str.title()
display(df)

In [0]:

#Applying the Function to remove extra spaces and Camel Case to Product
df['product']=df['product'].str.strip().str.title()
display(df)

In [0]:

print(df['product'].unique())

In [0]:

#Handel the Amount Column - Filling the Missing Values 
print(df['amount'].describe())

In [0]:

#Fill the missing values with median
df['amount']=df['amount'].fillna(df['amount'].median())
display(df)
     

In [0]:
#Fix the date column – parse & clean invalid entries

# Convert the 'date' column to datetime, invalid parsing will be set as NaT
df['date'] = pd.to_datetime(df['date'], errors='coerce')

# Print the number of invalid date entries
print("\n\nInvalid Date: ", df['date'].isnull().sum())

# Fill the NaT values using forward fill method
df['date'] = df['date'].fillna(method='ffill')

display(df)

In [0]:

#Add Some useful Columns for Analytics like YEar and Month

df['year']=df['date'].dt.year
df['month']=df['date'].dt.month
display(df)

### **Final Step is alway is to do a Sanity Check**

In [0]:

print(df.info())
print(df.describe())
print(df.head())



In [0]:

df.to_csv("sales_cleaned_final_dummy.csv", index=False)