# Project - Statistical Methods For Decision Making

### Marks: 60 points

# Problem Statement 1 - Wholesale Customers Analysis

### Business Context

A wholesale distributor operating in different regions of Portugal has information on annual spending of several items in their stores across different regions and channels. The data consists of 440 large retailers’ annual spending on 6 different varieties of productsin 3 different regions (Lisbon, Oporto, Other) and across different sales channel (Hotel, Retail).

### Objective

They want to analyze the data to get a fair idea about the demand of different restaurants which will help them in enhancing their customer experience. Suppose you are a Data Scientist at Foodhub and the Data Science team has shared some of the key questions that need to be answered. Perform the data analysis to find answers to these questions that will help the company to improve the business.

### Data Description



1. Buyer/Spender- ID's of customers
2. Region- Region of the distributor
3. Fresh- spending on Fresh Vegetables
4. Milk- spending on milk
5. Grocery- spending on grocery
6. Frozen- spending on frozen food
7. Detergents_paper- spending on detergents and toilet paper
8. Delicatessen- spending on instant foods





## Let us start by importing the required libraries

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from scipy import stats as st
from scipy.stats import iqr #To calculate the IQR - Interquartile Range
import statistics as stat # To calculate the MODE
from statistics import stdev # To calculate the standard deviation
import warnings
warnings.filterwarnings("ignore")

## Loading the data

In [None]:
# Read the data
df = pd.read_csv('_______') ## Fill the blank to read the data

In [None]:
# Returns the first 5 rows
df.head()

Unnamed: 0,Buyer/Spender,Channel,Region,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicatessen
0,1,Retail,Other,12669,9656,7561,214,2674,1338
1,2,Retail,Other,7057,9810,9568,1762,3293,1776
2,3,Retail,Other,6353,8808,7684,2405,3516,7844
3,4,Hotel,Other,13265,1196,4221,6404,507,1788
4,5,Retail,Other,22615,5410,7198,3915,1777,5185


## Data Overview

#### How many rows and columns are present in the data?

In [None]:
# Check the shape of the dataset
df._______ ## Fill in the blank

####  What are the datatypes of the different columns in the dataset?

In [None]:
# Check the datatypes
df.'____' #Write an appropriate function to check teh data type of each column

#### Are there any missing values in the data?

In [None]:
# Checking for missing values in the data
df.'______'  #Write the appropriate function to print the sum of null values for each column

#### Check the statistical summary of the data.

In [None]:
# Get the summary statistics of the numerical data
df.'_______' ## Write the appropriate function to print the statistical summary of the data (Hint - you have seen this in the case studies before)

## Exploratory Data Analysis (EDA)

### Univariate Analysis

#### Explore all the categorical variables and provide observations on their frequency.

In [None]:
plt.figure(figsize=(8, 8))

sns.'______'(data=df, x='Region')  ## Complete the code to plot the graph
plt.xlabel('Regions')
plt.ylabel('Number of Wholesale Distributors')
plt.show()

In [None]:
plt.figure(figsize=(8, 8))

sns.'_______'(data=df, x='Channel')  ## Complete the code to plot the graph
plt.xlabel('Channel')
plt.ylabel('Number of Wholesale Distributors')
plt.show()

#### Find the distribution of spending across all categories

In [None]:
import matplotlib.pyplot as plt

cols = ['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', 'Delicatessen']
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(18, 10))

for i in range(len(cols)):
    ax = axes[i // 3, i % 3]
    col = cols[i]
    ax.'____'(df[col], bins=50, edgecolor='#E6E6E6', color='Purple') ## Complete the code to plot a histogram
    ax.set_xlabel(col)
    ax.set_ylabel('Count')
    ax.set_title(col + " Histogram", fontsize=15)

plt.tight_layout()
plt.show()


#### Are there any outliers in the data?

In [None]:
cols = ['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', 'Delicatessen']
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(18, 10))

for i in range(len(cols)):
    ax = axes[i // 3, i % 3]
    col = cols[i]
    sns.'_______'(data='__', x=col, orient="v", ax=ax) ## Complete the code to create boxplot
    ax.set_title(col + " Boxplot", fontsize=15)

plt.tight_layout()
plt.show()


### Multivariate Analysis

**We will create a new column of total of spendings by adding the 6 different varieties.**

In [None]:
## Adding row totals to the data frame
df['Total'] = df.'___'(axis = 1) ## Complete the code to add the column
df.head()

####  Find the total spending across all regions

In [None]:
plt.figure(figsize=(8, 8))
RegionAggregated = df.groupby("____")["Total"].sum().reset_index() ## Complete the create a temporary dataframe
ax=sns.barplot(x="____", y="____", data=RegionAggregated) ## Complete the code to find plot a bar graph
ax.bar_label(ax.containers[0])
plt.title("Total Spending Barplot Region wise", fontsize=15)
plt.show()

#### Find the total spending of all the channels

In [None]:
plt.figure(figsize=(8, 8))
## Complete the create a temporary dataframe
## Complete the code to find plot a bar graph
ax.bar_label(ax.containers[0])
plt.title("Total Spending Barplot Channel wise", fontsize=15)
plt.show()

#### Find the total spending across regions via different channels

In [None]:
plt.figure(figsize=(8, 8))
sns.'____'(x='Region', y='Total', hue='____',data=df)
plt.title("Total Spending Barplot Region wise via different Channels", fontsize=15)
plt.show()

In [None]:
#Now drop the Total column
df.drop('____',axis=1, inplace=True) ## Complete the code to drop the column

#### Find the total spending on each of the categories across different region and channels

In [None]:
cols = ['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', 'Delicatessen']
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(18, 10))

for i in range(len(cols)):
    ax = axes[i // 3, i % 3]
    col = cols[i]
    sns.'_____'(data=df, x='____',y=col, hue='_____', orient="v", ax=ax)
    ax.set_title(col + " Barplot", fontsize=15)

plt.tight_layout()
plt.show()

#### Do the item varieties show similar behavior across region and channel?

Hint: There are 6 different varieties of items in the data.

**We will subset the dataset with respect to region and channel.**

In [None]:
# Channel wise data subset
Retail = df[df['Channel'] == "_____"]  ## Complete the create a temporary dataframe
Hotel = df[df['Channel'] == "_____"]  ## Complete the create a temporary dataframe

**To check the behaivour of the item varieties, we will check the statistical summary.**

In [None]:
Retail.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Buyer/Spender,142.0,183.0,132.136132,1.0,61.25,166.5,303.75,438.0
Fresh,142.0,8904.323944,8987.71475,18.0,2347.75,5993.5,12229.75,44466.0
Milk,142.0,10716.5,9679.631351,928.0,5938.0,7812.0,12162.75,73498.0
Grocery,142.0,16322.852113,12267.318094,2743.0,9245.25,12390.0,20183.5,92780.0
Frozen,142.0,1652.612676,1812.803662,33.0,534.25,1081.0,2146.75,11559.0
Detergents_Paper,142.0,7269.507042,6291.089697,332.0,3683.5,5614.5,8662.5,40827.0
Delicatessen,142.0,1753.43662,1953.797047,3.0,566.75,1350.0,2156.0,16523.0


In [None]:
Hotel.'______' ## Complete the code to perform decriptive analysis

In [None]:
# Region wise data subset
Lisbon = df[df['Region'] == "______"]  ## Complete the code to create a temporary dataframe
Oporto = df[df['Region'] == "______"]  ## Complete the code to create a temporary dataframe
Other = df[df['Region'] == "______"]  ## Complete the code to create a temporary dataframe

**To check the behaivour of the varities, we will do the descriptive analytics**

In [None]:
Lisbon.'______'  ## Complete the code to perform decriptive analysis

In [None]:
Oporto.'_______'  ## Complete the code to perform decriptive analysis

In [None]:
Other.'_______'  ## Complete the code to perform decriptive analysis

#### Is there any correlation between the different item varieties in terms of spending?

In [None]:
sns.color_palette("tab10")
plt.figure(figsize=(15,7))
mask = np.triu(np.ones_like(df.corr(), dtype=bool))
sns.'______'('______',annot = True,mask=mask)

## Conclusions and Recommendations

#### What are your conclusions from the analysis? What recommendations would you like to share to help improve the business?

#### Conclusions:
*  

#### Recommendations:
*  

# Problem Statement 2 - Education - Post 12th Standard

### Objective

The objective of this analysis is to gain insights into the characteristics of colleges and answer key questions related to the educational landscape. By understanding the data, we aim to inform strategies for improving the quality of education and enhancing the overall college experience. The analysis will provide valuable insights and recommendations for stakeholders in the education sector.

### Data Description

*  Names: Names of various university and colleges
*  Apps: Number of applications received
*  Accept: Number of applications accepted
*  Enroll: Number of new students enrolled
*  Top10perc: Percentage of new students from top 10% of Higher Secondary class
*  Top25perc: Percentage of new students from top 25% of Higher Secondary class
*  F.Undergrad: Number of full-time undergraduate students
*  P.Undergrad: Number of part-time undergraduate students
*  Outstate: Number of students for whom the particular college or university is Out-of-state tuition
*  Room.Board: Cost of Room and board
*  Books: Estimated book costs for a student
*  Personal: Estimated personal spending for a student
*  PhD: Percentage of faculties with Ph.D.’s
*  Terminal: Percentage of faculties with terminal degree
*  S.F.Ratio: Student/faculty ratio
*  perc.alumni: Percentage of alumni who donate
*  Expend: The Instructional expenditure per student
*  Grad.Rate: Graduation rate

## Let us start by importing the required libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
pd.options.display.float_format = '{:.2f}'.format

## Understanding the structure of the data

In [None]:
# Read the data
df = pd.read_csv('_______') ## Fill the blank to read the data
# Returns the first 5 rows
df.head()

#### How many rows and columns are present in the data?

In [None]:
# Check the shape of the dataset
df.'_______' ## Fill in the blank

####  What are the datatypes of the different columns in the dataset?

In [None]:
df.'____'

#### Are there any missing values in the data?

In [None]:
# Checking for missing values in the data
df.'______'  #Write the appropriate function to print the sum of null values for each column

#### Check the statistical summary of the data.Which Region and which Channel seems to be spend more? Which Region and which Channel seems to spend less?

In [None]:
# Get the summary statistics of the numerical data
df.'_______'.T ## Write the appropriate function to print the statistical summary of the data (Hint - you have seen this in the case studies before)

#### Drop the column which does not exhibit any value

In [None]:
#Now drop the irrelevant column
data.drop('____',axis=1, inplace=True) ## Complete the code to drop the column

## Exploratory Data Analysis (EDA)

### Univariate Analysis

In [None]:
cont_cols = list(df.'______')
for col in cont_cols:
    print(col)
    print('Skew :',round(df[col].'____',2))
    plt.figure(figsize=(15,4))
    plt.subplot(1,2,1)
    df[col].'____'(bins=10,edgecolor='#E6E6E6', color='Maroon')  #Complete the code to create a histogram
    plt.vlines(df[col].'____'(),ymin = 0, ymax = 40,color = 'Yellow')  #Complete the code to find the mean
    plt.vlines(df[col].'_____'(),ymin = 0, ymax = 40,color = 'White')  #Complete the code to find the median
    plt.ylabel('count')
    plt.subplot(1,2,2)
    sns.boxplot(df[col],color='Cyan')
        plt.savefig('{}_PLOT.png'.format(col))
    plt.show()

In [None]:
data_scatter = df.copy(deep=True)
fig, axes = plt.subplots(nrows=4, ncols=5, figsize=(20, 20))

for i, col in enumerate(data_scatter.columns):
    ax = axes[i // 5, i % 5]
    sns.'_____'(data_scatter[col], bins=20, color='Blue', ax=ax, kde='___') # Complete the code to show the distribution
    ax.set_title(col, color='DarkRed')

plt.tight_layout()

In [None]:

data_scatter=df.copy(deep=True)
fig=plt.figure(figsize=(20,20))
for i in range(0,len(data_scatter.columns)):
    ax=fig.add_subplot(4,5,i+1)
    sns.'_____'(data_scatter[data_scatter.columns[i]],color= 'Cyan') ## Complete the code to build boxplot
    ax.set_title(data_scatter.columns[i],color='Black')
plt.tight_layout()

### Bivariate Analysis

In [None]:
sns.color_palette("pastel")

cont_cols = list(df.columns)
for col in range(1, len(cont_cols)):
    print(cont_cols[col], 'vs', cont_cols[col-1])
    plt.figure(figsize=(15, 4))
    plt.subplot(1, 2, 1)
    sns.'________'(x=df[cont_cols[col]], y=df[cont_cols[col-1]]) ## Complete the code to build a scatterplot
    plt.subplot(1, 2, 2)
    sns.'_______'(np.corrcoef(df[cont_cols[col]], df[cont_cols[col-1]]), annot=True,
                yticklabels=[cont_cols[col], cont_cols[col-1]], xticklabels=[cont_cols[col], cont_cols[col-1]],
                cmap='BuPu', cbar=False) ## Complete the code to build a heatmap
    plt.show()

#### Is there any correlation between the columns?

In [None]:
sns.color_palette("tab10")
plt.figure(figsize=(15,7))
mask = np.triu(np.ones_like(df.corr(), dtype=bool))
sns.'_____'(df.'____'(),annot = True,mask=mask) ## Complete the code to show the correrlation using heatmap

## Conclusions and Recommendations

#### Conclusions:
*  

#### Recommendations:
*  

___