As an enthusiastic data science student, I constantly find myself diving deep into the vast ocean of data science. During my explorations, I noticed that I often ended up writing repetitive code snippets. That's when I had a brilliant idea: "Why not create a Python library that houses these frequently used functions, making them easily accessible and saving precious time?"

For demonstration purposes, I'm going to use the "House Prices - Advanced Regression Techniques" dataset from Kaggle.

In [36]:
import warnings
warnings.filterwarnings('ignore')

In [37]:
#installing DataGuru
!pip -q install DataGuru

[0m

In [38]:
# importing the DataGuru Library
import DataGuru as DG

In [39]:
# Importing Other neccessary Library
import plotly.express as px
import pandas as pd
import numpy as np
from scipy import stats

In [40]:
dataFrame = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/train.csv")

In [41]:
dataFrame.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


# Missing Values

  Missing values refer to the absence of data in a dataset for certain observations or variables. They can occur due to various reasons such as data collection errors, data corruption, or intentional omission.

Here's an example code snippet to find missing values in a DataFrame:

In [42]:
dataFrame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [43]:
dataFrame.isnull().sum()

Id                 0
MSSubClass         0
MSZoning           0
LotFrontage      259
LotArea            0
                ... 
MoSold             0
YrSold             0
SaleType           0
SaleCondition      0
SalePrice          0
Length: 81, dtype: int64

It is evident that the above code is not functioning as intended. We are unable to view the total number of missing values per column, and the output of `dataFrame.isnull().sum()` is truncated due to the high number of columns in our DataFrame. To address this issue, we can utilize the `max_rows` function from Pandas. However, this approach may not be ideal for someone deeply involved in data analysis.

To tackle this problem more effectively, we can employ the "missingValues" function from the **DataGuru** library. This function offers a comprehensive solution for handling missing values and provides a detailed report for better data understanding.

**missingValues(data):** This function computes the missing values statistics for each column in the input data. It generates a DataFrame containing information such as the variable name, total values, total missing values, missing value rate, data type, unique values, and total unique values. The missing data DataFrame is sorted in descending order based on the total number of missing values.

In [44]:
DG.missingValues(dataFrame)

Unnamed: 0_level_0,Total Value,Total Missing Value,Missing Value Rate,Data Type,Unique Value,Total Unique Value
Variable,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
PoolQC,1460,1453,0.9952,object,"[nan, Ex, Fa, Gd]",4
MiscFeature,1460,1406,0.963,object,"[nan, Shed, Gar2, Othr, TenC]",5
Alley,1460,1369,0.9377,object,"[nan, Grvl, Pave]",3
Fence,1460,1179,0.8075,object,"[nan, MnPrv, GdWo, GdPrv, MnWw]",5
FireplaceQu,1460,690,0.4726,object,"[nan, TA, Gd, Fa, Ex, Po]",6
LotFrontage,1460,259,0.1774,float64,"[65.0, 80.0, 68.0, 60.0, 84.0, 85.0, 75.0, nan...",111
GarageType,1460,81,0.0555,object,"[Attchd, Detchd, BuiltIn, CarPort, nan, Basmen...",7
GarageYrBlt,1460,81,0.0555,float64,"[2003.0, 1976.0, 2001.0, 1998.0, 2000.0, 1993....",98
GarageFinish,1460,81,0.0555,object,"[RFn, Unf, Fin, nan]",4
GarageQual,1460,81,0.0555,object,"[TA, Fa, Gd, nan, Ex, Po]",6


# Outliers

Outliers are data points that significantly deviate from the normal pattern or distribution of the dataset. They can be extreme values that are either unusually high or low compared to the majority of the data points.

To find outliers in a dataset using Python, you can use various statistical and visualization techniques. Here's a simple explanation of a commonly used method called the Z-score method:

1. Calculate the Z-score for each data point in the dataset. The Z-score measures how many standard deviations a data point is away from the mean. A Z-score of 0 means the data point is equal to the mean, positive values indicate points above the mean, and negative values indicate points below the mean.

2. Define a threshold value for the Z-score, typically around 2 to 3. Data points with Z-scores above or below this threshold are considered outliers.

3. Identify and flag the data points that meet the outlier criteria.

Here's an example code snippet in Python using the scipy library to find outliers using the Z-score method:

In [45]:
# Assume you have a pandas DataFrame called 'dataFrame' containing your dataset
numeric_columns = dataFrame.select_dtypes(include=np.number).columns

# Calculate the Z-scores for numeric columns
z_scores = stats.zscore(dataFrame[numeric_columns])

# Define the threshold value for outliers
threshold = 3

# Find the indices of outliers for each numeric column
outliers = np.where(np.abs(z_scores) > threshold)

# Loop through each numeric column to retrieve the outlier values
for col, col_outliers in zip(numeric_columns, outliers[1]):
    outlier_values = dataFrame[col].iloc[col_outliers]
    print("Outliers in column", col, ":", outlier_values)

Outliers in column Id : 19
Outliers in column MSSubClass : 70
Outliers in column LotFrontage : nan
Outliers in column LotArea : 8500
Outliers in column OverallQual : 8
Outliers in column OverallCond : 8
Outliers in column YearBuilt : 2002
Outliers in column YearRemodAdd : 2002
Outliers in column MasVnrArea : 0.0
Outliers in column BsmtFinSF1 : 646
Outliers in column BsmtFinSF2 : 0
Outliers in column BsmtUnfSF : 468
Outliers in column TotalBsmtSF : 1114
Outliers in column 1stFlrSF : 1795
Outliers in column 2ndFlrSF : 0
Outliers in column LowQualFinSF : 0
Outliers in column GrLivArea : 1262
Outliers in column BsmtFullBath : 0
Outliers in column BsmtHalfBath : 0
Outliers in column FullBath : 1
Outliers in column HalfBath : 0
Outliers in column BedroomAbvGr : 3
Outliers in column KitchenAbvGr : 2
Outliers in column TotRmsAbvGrd : 6
Outliers in column Fireplaces : 2
Outliers in column GarageYrBlt : 1966.0
Outliers in column GarageCars : 2
Outliers in column GarageArea : 319
Outliers in colu

We can clearly see that the above code snippet only provides the total number of outliers without additional information. It doesn't offer valuable insights. Furthermore, we have solely used the Z-score method to identify outliers in the dataframe. However, there are other methods like the "IQR" method. This may not be ideal for data enthusiasts who value their time.

To overcome this problem, we can utilize the "findOutliers" function from the DataGuru library. It provides a more comprehensive approach to identifying outliers and saves time for data analysts.

**FindOutliers(data, method='zscore'):** This function detects outliers in numeric columns of the input data. It supports two outlier detection methods: Z-score and IQR (interquartile range). By default, the Z-score method is used. The function iterates over each numeric column and applies the specified outlier detection method. It then collects information about the column, including the mean, standard deviation, outliers, total outliers, and percentage of outliers. The resulting DataFrame is sorted in descending order based on the percentage of outliers.

In [46]:
DG.findOutliers(dataFrame)

Unnamed: 0,Column,Mean,Standard Deviation,Outliers,Total Outliers,Percentage of Outliers
14,BsmtHalfBath,0.057534,0.238671,1 1 1041 1 1029 1 1006 1 953 ...,82,5.616438
17,KitchenAbvGr,1.046575,0.220263,954 0 8 2 894 2 897 2 910 2  ...,68,4.657534
25,ScreenPorch,15.060959,55.738317,104 184 351 184 366 185 1067 18...,55,3.767123
23,EnclosedPorch,21.95411,61.098214,1202 208 520 210 1393 212 1150 21...,51,3.493151
6,BsmtFinSF2,46.549315,161.264017,414 531 493 532 440 539 842 ...,50,3.424658
0,MSSubClass,56.89726,42.286082,9 190 1266 190 1190 190 1186 19...,30,2.054795
3,OverallCond,5.575342,1.112418,375 1 88 2 250 2 378 2 398 ...,28,1.917808
22,OpenPorchSF,46.660274,66.233333,775 247 293 250 947 252 28 25...,27,1.849315
24,3SsnPorch,3.409589,29.307289,1156 96 120 130 187 140 704 14...,23,1.575342
21,WoodDeckSF,94.244521,125.295863,1044 474 166 476 828 486 848 48...,22,1.506849


# Analyzing  The Data

Analyzing the data is a crucial step in data-related projects as it offers great insights from raw data, which are highly beneficial to our project. However, creating data visualizations using Python libraries like Pandas, Seaborn, Matplotlib, or Plotly can be a tedious task. To make data visualization more interesting and efficient.

we can leverage the power of DataGuru's `analyzeData` function, which is a part of the library. It simplifies the process of generating insightful visualizations from the given data. By utilizing DataGuru, we can enhance the quality and effectiveness of our data visualizations, making them more impactful for our project.

**analyzeData(data, numCol, catCol):** This function performs an analysis on the input data by grouping a numeric column (numCol) based on a categorical column (catCol). It calculates the mean, standard deviation, and percentage of the numeric column for each category. The results are displayed in a DataFrame sorted in descending order based on the mean value. Additionally, a bar plot is generated using Plotly Express, visualizing the mean, standard deviation, and percentage for each category.

In [47]:
DG.analyzeData(dataFrame,'SalePrice','SaleCondition')

Analysis for column 'SalePrice':
                        Mean  Standard Deviation  Percentage
SaleCondition                                               
Partial        272291.752000       103696.404119    8.561644
Normal         175202.219533        69713.636280   82.054795
Alloca         167377.416667        84460.527502    0.821918
Family         149600.000000        47820.002421    1.369863
Abnorml        146526.623762        82796.213395    6.917808
AdjLand        104125.000000        26135.464411    0.273973



🚀 Future Enhancements:
  DataGuru has exciting plans ahead! Enhancements include segregable features in analyzeData, model comparison capabilities, and advanced data preprocessing functionalities.

🌟 Join me on this data science journey with DataGuru. Collaboration and knowledge sharing are essential for growth. Feel free to share your thoughts, ideas, and suggestions. Let's learn together and create an amazing tool for the data science community! 🤝

If we take a closer look at the output of the DataGuru functions, we can identify some mistakes.
For instance,
* In the analyzeData function, the plot values for standard deviation and percentage are not generated correctly.
* The naming convention for missing values is poorly implemented. There are several other errors in DataGuru.

However, as a supportive community, we can elevate DataGuru to new heights in the future. You can access the DataGuru GitHub repository through this link: [DataGuru](https://github.com/gunaxprofessional/DataGuru).  and feel free make your valuable contribution to the DataGuru

This marks my initial contribution to the open-source community, and I would greatly appreciate any feedback or contributions to our DataGuru Python library.