# App behaviour Analysis

# Project Overview

In today's market, many companies have a mobile presence. Often, these companies provide free products/services in their mobile apps in an attempt to transition their customers to a paid membership. Some examples of paid products, which originate from free ones, include YouTube Red, Pandora Premium, Audible Subscription and You Need a Budget. Since marketing efforts are never free, these companies need to know exactly who to target with their offers and promotions.

Market: The target audience is customers who use a company's free product. For this project, this refers to users who installed (and used) the company's free mobile app.
Product: The paid memberships often provide enhanced versions of the free products already given for free, alongside new features. For example, YouTube Red allows you to leave the app while still listening to a video.
Goal: The objective of this model is to predict which users will not subscribe to the paid membership, so that greater marketing efforts can go into trying to "convert" them to paid users.

# Data Preprocessing

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv("new_appdata10.csv")

In [3]:
df.head()

Unnamed: 0,user,dayofweek,hour,age,numscreens,minigame,used_premium_feature,enrolled,liked,location,...,SecurityModal,ResendToken,TransactionList,NetworkFailure,ListPicker,other,savings_count,cm_count,cc_count,loan_count
0,235136,3,2,23,15,0,0,0,0,0,...,0,0,0,0,0,7,0,0,0,1
1,333588,6,1,24,13,0,0,0,0,1,...,0,0,0,0,0,5,0,0,0,1
2,254414,1,19,23,3,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,1
3,234192,4,16,28,40,0,0,1,0,1,...,0,0,0,0,0,6,0,3,0,1
4,51549,1,18,31,32,0,0,1,1,0,...,0,0,0,0,0,10,0,2,0,1


In [4]:
df.shape

(50000, 50)

In [5]:
response = df["enrolled"]

In [7]:
df = df.drop(["enrolled"],axis=1)

In [8]:
df.head()

Unnamed: 0,user,dayofweek,hour,age,numscreens,minigame,used_premium_feature,liked,location,Institutions,...,SecurityModal,ResendToken,TransactionList,NetworkFailure,ListPicker,other,savings_count,cm_count,cc_count,loan_count
0,235136,3,2,23,15,0,0,0,0,0,...,0,0,0,0,0,7,0,0,0,1
1,333588,6,1,24,13,0,0,0,1,1,...,0,0,0,0,0,5,0,0,0,1
2,254414,1,19,23,3,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,1
3,234192,4,16,28,40,0,0,0,1,0,...,0,0,0,0,0,6,0,3,0,1
4,51549,1,18,31,32,0,0,1,0,1,...,0,0,0,0,0,10,0,2,0,1


In [10]:
from sklearn.model_selection import train_test_split

In [11]:
X_train,X_test,y_train,y_test = train_test_split(df,response,test_size=0.2,random_state=0)

In [12]:
train_identifier = X_train["user"]
X_train = X_train.drop(["user"],axis=1)
test_identifier = X_test["user"]
X_test = X_test.drop(["user"],axis=1)

# Feature scaling 

In [15]:
from sklearn.preprocessing import StandardScaler

In [16]:
sc = StandardScaler()

In [22]:
X_train_scaled = pd.DataFrame(sc.fit_transform(X_train), columns=X_train.columns.values, index=X_train.index.values)
X_test_scaled = pd.DataFrame(sc.transform(X_test), columns=X_test.columns.values, index=X_test.index.values)

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)
  


# Model Building

In [13]:
from sklearn.linear_model import LogisticRegression

In [29]:
Logit = LogisticRegression(random_state=0, penalty='l1', solver='saga')

In [30]:
Logit.fit(X_train_scaled,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l1', random_state=0, solver='saga',
          tol=0.0001, verbose=0, warm_start=False)

In [31]:
y_pred = Logit.predict(X_test_scaled)

# Confusion Matrix & Accuracy Score

In [34]:
from sklearn.metrics import confusion_matrix, classification_report,accuracy_score

In [33]:
confusion_matrix(y_test,y_pred)

array([[3886, 1186],
       [1133, 3795]], dtype=int64)

In [35]:
accuracy_score(y_test,y_pred)

0.7681

# Classification Report  

In [37]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.77      0.77      0.77      5072
           1       0.76      0.77      0.77      4928

   micro avg       0.77      0.77      0.77     10000
   macro avg       0.77      0.77      0.77     10000
weighted avg       0.77      0.77      0.77     10000



# Cross validation score

In [39]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(Logit, X_train_scaled, y_train,cv=10)

In [46]:
accuracies.mean(), accuracies.std()*2

(0.7671998879171806, 0.00942405074559825)

# Final Results

In [55]:
final_results = pd.concat([y_test, test_identifier], axis=1).dropna()

final_results['predicted'] = y_pred

final_results = final_results[['user', 'enrolled', 'predicted']].reset_index(drop=True)

final_results

Unnamed: 0,user,enrolled,predicted
0,239786,1,1
1,279644,1,1
2,98290,0,0
3,170150,1,1
4,237568,1,1
5,65042,1,0
6,207226,1,1
7,363062,0,0
8,152296,1,1
9,64484,0,0
