### Exemplo de Modelo Preditivo Para Realizar Previsões De Compras Futuras 
* Olá , sou Maria Carolina e esta é a minha primeira experiência em análise preditiva com o uso do Suporte Vector Machine (SVM).
* Este projeto esta sendo desenvolvido com base em estudos a partir de uma aula em vídeo  

### Definição do problema de Negócio: 
* o primeiro passo é definir o objetivo , o problema que a análise vai buscar resolver 
* Nossa tarefa consistirá em avaliar os atributos que impactam a decisão de um usuário ao comprar produtos online e criar um modelo preditivo para fazer previsões de compras futuras.

### Dataset :
 * Online Shoppers Purchasing Intention Dataset:   
https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset  
* O conjunto de dados é composto por valores de atributos de 12.330 sessões online distintas. Esses dados foram compilados assegurando que cada sessão é única a um usuário e distribuída ao longo de um ano, para minimizar viés relacionados a campanhas específicas, datas comemorativas, perfis de usuários ou períodos determinados.   
* Incluído no conjunto, existem 10 atributos quantitativos e 8 qualitativos. O atributo 'Revenue' é designado como a variável dependente, servindo como a etiqueta para classificação ou o que denominamos de variável alvo.  


In [1]:
# Importando bibliotecas que iremos utilizar...
import time
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, roc_auc_score
from sklearn import svm
import sklearn
import matplotlib
import warnings
warnings.filterwarnings("ignore")
warnings.filterwarnings('ignore', category=DeprecationWarning)

### Carregando os Dados :

In [2]:
df_original = pd.read_csv('online_shoppers_intention.csv')
df_original.head()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0.0,0.0,0.0,0.0,1.0,0.0,0.2,0.2,0.0,0.0,Feb,1,1,1,1,Returning_Visitor,False,False
1,0.0,0.0,0.0,0.0,2.0,64.0,0.0,0.1,0.0,0.0,Feb,2,2,1,2,Returning_Visitor,False,False
2,0.0,-1.0,0.0,-1.0,1.0,-1.0,0.2,0.2,0.0,0.0,Feb,4,1,9,3,Returning_Visitor,False,False
3,0.0,0.0,0.0,0.0,2.0,2.666667,0.05,0.14,0.0,0.0,Feb,3,2,2,4,Returning_Visitor,False,False
4,0.0,0.0,0.0,0.0,10.0,627.5,0.02,0.05,0.0,0.0,Feb,3,3,1,4,Returning_Visitor,True,False


### explicação : 
*


### Análise Exploratória 

In [3]:
#shape 
df_original.shape 

(12330, 18)

In [4]:
#Tipos de dados 
df_original.dtypes

Administrative             float64
Administrative_Duration    float64
Informational              float64
Informational_Duration     float64
ProductRelated             float64
ProductRelated_Duration    float64
BounceRates                float64
ExitRates                  float64
PageValues                 float64
SpecialDay                 float64
Month                       object
OperatingSystems             int64
Browser                      int64
Region                       int64
TrafficType                  int64
VisitorType                 object
Weekend                       bool
Revenue                       bool
dtype: object

In [5]:
# Verificando valores em branco(missing))
print(df_original.isna().sum())

Administrative             14
Administrative_Duration    14
Informational              14
Informational_Duration     14
ProductRelated             14
ProductRelated_Duration    14
BounceRates                14
ExitRates                  14
PageValues                  0
SpecialDay                  0
Month                       0
OperatingSystems            0
Browser                     0
Region                      0
TrafficType                 0
VisitorType                 0
Weekend                     0
Revenue                     0
dtype: int64


In [6]:
#Removendo as linhas com os valores em branco(missing)
df_original.dropna(inplace = True)

In [7]:
#Verificando valores missing
print(df_original.isna().sum())

Administrative             0
Administrative_Duration    0
Informational              0
Informational_Duration     0
ProductRelated             0
ProductRelated_Duration    0
BounceRates                0
ExitRates                  0
PageValues                 0
SpecialDay                 0
Month                      0
OperatingSystems           0
Browser                    0
Region                     0
TrafficType                0
VisitorType                0
Weekend                    0
Revenue                    0
dtype: int64


In [8]:
# Avaliando o shape dos arquivos após apagar os valores em branco(missing)
df_original.shape

(12316, 18)

In [9]:
#Verificando valores únicos 
df_original.nunique()

Administrative               27
Administrative_Duration    3336
Informational                17
Informational_Duration     1259
ProductRelated              311
ProductRelated_Duration    9552
BounceRates                1872
ExitRates                  4777
PageValues                 2704
SpecialDay                    6
Month                        10
OperatingSystems              8
Browser                      13
Region                        9
TrafficType                  20
VisitorType                   3
Weekend                       2
Revenue                       2
dtype: int64

### Explicação :
* Com o objetivo de visualizar melhor ,dividimos os dados em variáveis contínuas e categóricas. as variáveis com menos de 30 entradas únicas serão tratadas como categórica. 
* definições :
    * Variáveis Contínuas: São aquelas que podem assumir um número infinito de valores em um intervalo específico.(Exemplo- idade )
    * Variáveis Categóricas:São aquelas que representam categorias ou grupos distintos e definição exclusivamente exclusvas (Exemplo- cor dos olhos)

In [10]:
# Preparando os dados para o plot
#Criando uma cópia do dataset original
df = df_original.copy()

#listas vazias para os resultados 
continuous = []
categorical = [] 

#Loop pelas colunas 
for c in df.columns[:-1]:
    if df.nunique()[c] > 30:
        continuous.append(c)
    else: 
        categorical.append(c)

In [11]:
continuous

['Administrative_Duration',
 'Informational_Duration',
 'ProductRelated',
 'ProductRelated_Duration',
 'BounceRates',
 'ExitRates',
 'PageValues']

In [12]:
# Variáveis Contínuas:
df[continuous].head()

Unnamed: 0,Administrative_Duration,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues
0,0.0,0.0,1.0,0.0,0.2,0.2,0.0
1,0.0,0.0,2.0,64.0,0.0,0.1,0.0
2,-1.0,-1.0,1.0,-1.0,0.2,0.2,0.0
3,0.0,0.0,2.0,2.666667,0.05,0.14,0.0
4,0.0,0.0,10.0,627.5,0.02,0.05,0.0


In [13]:
# Variáveis Categóricas:
df[categorical].head()

Unnamed: 0,Administrative,Informational,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend
0,0.0,0.0,0.0,Feb,1,1,1,1,Returning_Visitor,False
1,0.0,0.0,0.0,Feb,2,2,1,2,Returning_Visitor,False
2,0.0,0.0,0.0,Feb,4,1,9,3,Returning_Visitor,False
3,0.0,0.0,0.0,Feb,3,2,2,4,Returning_Visitor,False
4,0.0,0.0,0.0,Feb,3,3,1,4,Returning_Visitor,True


### Gráficos para Variáveis numéricas: