# Problem statement

A cloth manufacturing company is interested to know about the segment or attributes causes high sale. 
Approach - A Random Forest can be built with target variable Sales (we will first convert it in categorical variable) & all other variable will be independent in the analysis.  

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.tree import  DecisionTreeClassifier
from sklearn.metrics import classification_report,accuracy_score
from sklearn import tree

In [2]:
company = pd.read_csv("Company_Data.csv")

In [3]:
company

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban,US
0,9.50,138,73,11,276,120,Bad,42,17,Yes,Yes
1,11.22,111,48,16,260,83,Good,65,10,Yes,Yes
2,10.06,113,35,10,269,80,Medium,59,12,Yes,Yes
3,7.40,117,100,4,466,97,Medium,55,14,Yes,Yes
4,4.15,141,64,3,340,128,Bad,38,13,Yes,No
...,...,...,...,...,...,...,...,...,...,...,...
395,12.57,138,108,17,203,128,Good,33,14,Yes,Yes
396,6.14,139,23,3,37,120,Medium,55,11,No,Yes
397,7.41,162,26,12,368,159,Medium,40,18,Yes,Yes
398,5.94,100,79,7,284,95,Bad,50,12,Yes,Yes


In [4]:
company.shape

(400, 11)

In [5]:
company.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Sales        400 non-null    float64
 1   CompPrice    400 non-null    int64  
 2   Income       400 non-null    int64  
 3   Advertising  400 non-null    int64  
 4   Population   400 non-null    int64  
 5   Price        400 non-null    int64  
 6   ShelveLoc    400 non-null    object 
 7   Age          400 non-null    int64  
 8   Education    400 non-null    int64  
 9   Urban        400 non-null    object 
 10  US           400 non-null    object 
dtypes: float64(1), int64(7), object(3)
memory usage: 34.5+ KB


<b>No null values are there</b>

In [6]:
company[company.duplicated()]

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban,US


<b>No duplicate values are there</b>

### Understanding the target variable

We are interested to know about the The sales value is a continuous value in thousands for each location.attributes which causes high sales.</b>

In [7]:
company['Sales'].value_counts()

7.80     4
6.67     3
8.77     3
9.32     3
5.87     3
        ..
8.89     1
13.39    1
9.14     1
5.07     1
9.50     1
Name: Sales, Length: 336, dtype: int64

### Converting the categorical variables into numerical

There are 3 categorical columns ShelveLoc,Urban,US. We will convert them into numerical by using Encoding techniques.

In [8]:
company['ShelveLoc'].value_counts()

Medium    219
Bad        96
Good       85
Name: ShelveLoc, dtype: int64

In [9]:
company['Urban'].value_counts()

Yes    282
No     118
Name: Urban, dtype: int64

In [10]:
company['US'].value_counts()

Yes    258
No     142
Name: US, dtype: int64

In [11]:
company = pd.get_dummies(company, columns=['Urban'])

In [12]:
company

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,US,Urban_No,Urban_Yes
0,9.50,138,73,11,276,120,Bad,42,17,Yes,0,1
1,11.22,111,48,16,260,83,Good,65,10,Yes,0,1
2,10.06,113,35,10,269,80,Medium,59,12,Yes,0,1
3,7.40,117,100,4,466,97,Medium,55,14,Yes,0,1
4,4.15,141,64,3,340,128,Bad,38,13,No,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...
395,12.57,138,108,17,203,128,Good,33,14,Yes,0,1
396,6.14,139,23,3,37,120,Medium,55,11,Yes,1,0
397,7.41,162,26,12,368,159,Medium,40,18,Yes,0,1
398,5.94,100,79,7,284,95,Bad,50,12,Yes,0,1


In [13]:
company = pd.get_dummies(company, columns=['US'])

In [14]:
label_encoder = preprocessing.LabelEncoder()

In [15]:
company['ShelveLoc']= label_encoder.fit_transform(company['ShelveLoc']) 
company.head()

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban_No,Urban_Yes,US_No,US_Yes
0,9.5,138,73,11,276,120,0,42,17,0,1,0,1
1,11.22,111,48,16,260,83,1,65,10,0,1,0,1
2,10.06,113,35,10,269,80,2,59,12,0,1,0,1
3,7.4,117,100,4,466,97,2,55,14,0,1,0,1
4,4.15,141,64,3,340,128,0,38,13,0,1,1,0


In [16]:
company['Sales_cat'] = pd.cut(x=company['Sales'], bins=[0, 5.5, 11, 16.5], labels=['Low','Medium','High'], right=False)
company.head()

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban_No,Urban_Yes,US_No,US_Yes,Sales_cat
0,9.5,138,73,11,276,120,0,42,17,0,1,0,1,Medium
1,11.22,111,48,16,260,83,1,65,10,0,1,0,1,High
2,10.06,113,35,10,269,80,2,59,12,0,1,0,1,Medium
3,7.4,117,100,4,466,97,2,55,14,0,1,0,1,Medium
4,4.15,141,64,3,340,128,0,38,13,0,1,1,0,Low


In [17]:
company['Sales_cat'].value_counts()

Medium    248
Low       103
High       49
Name: Sales_cat, dtype: int64

### Separating feature data and Label data and train-test split

In [18]:
X = company.drop('Sales',axis=1)
X = company.drop('Sales_cat',axis=1)
X

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban_No,Urban_Yes,US_No,US_Yes
0,9.50,138,73,11,276,120,0,42,17,0,1,0,1
1,11.22,111,48,16,260,83,1,65,10,0,1,0,1
2,10.06,113,35,10,269,80,2,59,12,0,1,0,1
3,7.40,117,100,4,466,97,2,55,14,0,1,0,1
4,4.15,141,64,3,340,128,0,38,13,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,12.57,138,108,17,203,128,1,33,14,0,1,0,1
396,6.14,139,23,3,37,120,2,55,11,1,0,0,1
397,7.41,162,26,12,368,159,2,40,18,0,1,0,1
398,5.94,100,79,7,284,95,0,50,12,0,1,0,1


In [19]:
X.head()

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban_No,Urban_Yes,US_No,US_Yes
0,9.5,138,73,11,276,120,0,42,17,0,1,0,1
1,11.22,111,48,16,260,83,1,65,10,0,1,0,1
2,10.06,113,35,10,269,80,2,59,12,0,1,0,1
3,7.4,117,100,4,466,97,2,55,14,0,1,0,1
4,4.15,141,64,3,340,128,0,38,13,0,1,1,0


In [20]:
Y = company['Sales_cat']

In [21]:
Y

0      Medium
1        High
2      Medium
3      Medium
4         Low
        ...  
395      High
396    Medium
397    Medium
398    Medium
399    Medium
Name: Sales_cat, Length: 400, dtype: category
Categories (3, object): ['Low' < 'Medium' < 'High']

In [22]:
Y.isnull().sum()

0

In [24]:
X_train, X_test,Y_train,Y_test = train_test_split(X,Y, test_size=0.2,random_state=30)

# Random Forest classifier

In [25]:
from sklearn.ensemble import RandomForestClassifier

In [26]:
forest_new = RandomForestClassifier(n_estimators=250,max_depth=10,min_samples_split=20,criterion='entropy')  # n_estimators is the number of decision trees
forest_new.fit(X_train, Y_train)

RandomForestClassifier(criterion='entropy', max_depth=10, min_samples_split=20,
                       n_estimators=250)

In [27]:
pred_rf_train = forest_new.predict(X_train) # predicting on train data set 
pd.Series(pred_rf_train).value_counts()

Medium    192
Low        85
High       43
dtype: int64

In [28]:
pd.crosstab(Y_train,pred_rf_train)

col_0,High,Low,Medium
Sales_cat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Low,0,85,0
Medium,0,0,192
High,43,0,0


In [29]:
pred_rf = forest_new.predict(X_test) # predicting on test data set 
pd.Series(pred_rf).value_counts()

Medium    57
Low       18
High       5
dtype: int64

In [30]:
pd.crosstab(Y_test,pred_rf)

col_0,High,Low,Medium
Sales_cat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Low,0,18,0
Medium,0,0,56
High,5,0,1


In [31]:
accuracy = accuracy_score(Y_test,pred_rf)
print(accuracy)

0.9875


In [32]:
accuracy_train = accuracy_score(Y_train,pred_rf_train)
print(accuracy_train)

1.0


In [33]:
print(classification_report(Y_test,pred_rf))

              precision    recall  f1-score   support

        High       1.00      0.83      0.91         6
         Low       1.00      1.00      1.00        18
      Medium       0.98      1.00      0.99        56

    accuracy                           0.99        80
   macro avg       0.99      0.94      0.97        80
weighted avg       0.99      0.99      0.99        80



Accuracy of both test and train data is nearly equal.