# Assignment 1:
### - The automated_stat_analyzer Function
- Scenario: A retail company needs a utility to quickly summarize sales data. Students must create a function that identifies the 
"Central Tendency" and "Dispersion" of any numerical column.
- ### Requirements:

* Accept a Pandas DataFrame and a column name.

* Calculate the Mean, Median, and Standard Deviation .

* Identify if the data is "Skewed" by comparing the Mean and Median.


* Bonus: If the column is categorical, return the Mode instead.

### Your Data

In [47]:
import pandas as pd
import numpy as np

# Create a synthetic Company Sales Dataset
data = {
    'Transaction_ID': range(1, 11),
    'Product_Category': ['Electronics', 'Home', 'Electronics', 'Sports', 'Home', 
                         'Electronics', 'Home', 'Sports', 'Electronics', 'Electronics'],
    'Sales_Amount': [150, 200, 155, 300, 210, 180, 205, 1000, 190, 160], # 1000 is an Outlier
    'Customer_Age': [25, 34, np.nan, 45, 23, 31, 29, np.nan, 38, 40],    # Contains Nulls (NaN)
    'Rating': [5, 4, 3, 5, 2, 4, 5, 2, 4, 3]
}

df_test = pd.DataFrame(data)

# Save to CSV for students to practice loading files [cite: 74]
df_test.to_csv('company_sales_test.csv', index=False)
print("Test dataset created successfully!")

Test dataset created successfully!


In [48]:
df_test.head()

Unnamed: 0,Transaction_ID,Product_Category,Sales_Amount,Customer_Age,Rating
0,1,Electronics,150,25.0,5
1,2,Home,200,34.0,4
2,3,Electronics,155,,3
3,4,Sports,300,45.0,5
4,5,Home,210,23.0,2


In [49]:
df_test[1:2]

Unnamed: 0,Transaction_ID,Product_Category,Sales_Amount,Customer_Age,Rating
1,2,Home,200,34.0,4


In [50]:
df_test['Product_Category'].mode().sum()

'Electronics'

In [51]:
import pandas as pd

def automated_stat_analyzer(df, column_name):
    """
    Company Task: Provide a summary report of a specific data variable.
    
    Instructions:
    1. Check if the column is numerical or categorical.
    2. For numerical: Calculate Mean, Median, and Standard Deviation.
    3. For categorical: Calculate the Mode.
    4. Return a dictionary with these statistical measures.
    """
    # TODO: Implement using df[column_name].mean(), .median(), .std(), or .mode()  you can used Sales_Amount for your test case
    if df[column_name].dtype == 'object' :
         type_data = df[column_name].dtype         
         mode_d    =  df[column_name].mode()[0]
         
         return f'your data is {type_data} and "mode = {mode_d} '
    elif df[column_name].dtype == 'int' :
       mean_v = df[column_name].mean()
       median_v = df[column_name].median()
       std_v = df[column_name].std()


       return    { 'type' : 'numbers',
                   'mean' : mean_v,
                   'median' : median_v ,
                   'std' : std_v
           
          }
    

In [52]:
automated_stat_analyzer(df_test,'Sales_Amount')

{'type': 'numbers',
 'mean': np.float64(275.0),
 'median': np.float64(195.0),
 'std': np.float64(258.30645021412494)}

## Assignment 2: 
  ### The null_handling_strategy Function


#### Scenario: Incoming user data often has missing values.Students must implement a flexible strategy to handle these "Null Values" to prepare data for Machine Learning.
### Requirements:

* Check for null values in the DataFrame.

* Apply a strategy based on parameters: "drop_rows", "fill_mean", or "fill_median" .

* Ensure the function only fills numerical columns when using mean or median.

In [112]:
df_test

Unnamed: 0,Transaction_ID,Product_Category,Sales_Amount,Customer_Age,Rating
0,1,Electronics,150,25.0,5
1,2,Home,200,34.0,4
2,3,Electronics,155,0.0,3
3,4,Sports,300,45.0,5
4,5,Home,210,23.0,2
5,6,Electronics,180,31.0,4
6,7,Home,205,29.0,5
7,8,Sports,1000,0.0,2
8,9,Electronics,190,38.0,4
9,10,Electronics,160,40.0,3


In [109]:
import pandas as pd
def null_handling_strategy(df, strategy="fill_mean"):
    """
    Company Task: Clean a dataset by resolving missing (NaN) values.
    """
    # TODO: Implement using .isnull(), .dropna(), or .fillna() you can used Customer_Age for your test case
    if strategy == 'mean' : 
         df.fillna(df.mean(numeric_only=True) )
    elif strategy == 'median' : 
         df.fillna(df.median(numeric_only=True))
    elif strategy == 'zeros' : 
         df.fillna(0, inplace= True)
    elif strategy == 'drop' :
         df = df.dropna()
    return df

    


In [110]:
null_handling_strategy(df_test, 'drop')  

Unnamed: 0,Transaction_ID,Product_Category,Sales_Amount,Customer_Age,Rating
0,1,Electronics,150,25.0,5
1,2,Home,200,34.0,4
2,3,Electronics,155,0.0,3
3,4,Sports,300,45.0,5
4,5,Home,210,23.0,2
5,6,Electronics,180,31.0,4
6,7,Home,205,29.0,5
7,8,Sports,1000,0.0,2
8,9,Electronics,190,38.0,4
9,10,Electronics,160,40.0,3


In [98]:
null_handling_strategy(df_test , 'drop')

Unnamed: 0,Transaction_ID,Product_Category,Sales_Amount,Customer_Age,Rating
0,1,Electronics,150,25.0,5
1,2,Home,200,34.0,4
2,3,Electronics,155,0.0,3
3,4,Sports,300,45.0,5
4,5,Home,210,23.0,2
5,6,Electronics,180,31.0,4
6,7,Home,205,29.0,5
7,8,Sports,1000,0.0,2
8,9,Electronics,190,38.0,4
9,10,Electronics,160,40.0,3


In [82]:
null_handling_strategy(df_test , 'drob')

Unnamed: 0,Transaction_ID,Product_Category,Sales_Amount,Customer_Age,Rating
0,1,Electronics,150,25.0,5
1,2,Home,200,34.0,4
2,3,Electronics,155,0.0,3
3,4,Sports,300,45.0,5
4,5,Home,210,23.0,2
5,6,Electronics,180,31.0,4
6,7,Home,205,29.0,5
7,8,Sports,1000,0.0,2
8,9,Electronics,190,38.0,4
9,10,Electronics,160,40.0,3


In [73]:
df_test

Unnamed: 0,Transaction_ID,Product_Category,Sales_Amount,Customer_Age,Rating
0,1,Electronics,150,25.0,5
1,2,Home,200,34.0,4
2,3,Electronics,155,,3
3,4,Sports,300,45.0,5
4,5,Home,210,23.0,2
5,6,Electronics,180,31.0,4
6,7,Home,205,29.0,5
7,8,Sports,1000,,2
8,9,Electronics,190,38.0,4
9,10,Electronics,160,40.0,3


In [72]:
df_test.dropna()

Unnamed: 0,Transaction_ID,Product_Category,Sales_Amount,Customer_Age,Rating
0,1,Electronics,150,25.0,5
1,2,Home,200,34.0,4
3,4,Sports,300,45.0,5
4,5,Home,210,23.0,2
5,6,Electronics,180,31.0,4
6,7,Home,205,29.0,5
8,9,Electronics,190,38.0,4
9,10,Electronics,160,40.0,3


In [65]:
null_handling_strategy(df_test , "mo" )

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [53]:
df_test.isnull()


Unnamed: 0,Transaction_ID,Product_Category,Sales_Amount,Customer_Age,Rating
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,True,False
3,False,False,False,False,False
4,False,False,False,False,False
5,False,False,False,False,False
6,False,False,False,False,False
7,False,False,False,True,False
8,False,False,False,False,False
9,False,False,False,False,False


In [54]:
df_test.fillna(value= "t")

Unnamed: 0,Transaction_ID,Product_Category,Sales_Amount,Customer_Age,Rating
0,1,Electronics,150,25.0,5
1,2,Home,200,34.0,4
2,3,Electronics,155,t,3
3,4,Sports,300,45.0,5
4,5,Home,210,23.0,2
5,6,Electronics,180,31.0,4
6,7,Home,205,29.0,5
7,8,Sports,1000,t,2
8,9,Electronics,190,38.0,4
9,10,Electronics,160,40.0,3


In [42]:
df_test = pd.isnull(df_test).sum()
df_test

np.int64(0)