## Tratamento dos Dados: One Piece

- API Utilizada: https://api-onepiece.com/en/documentation
- Objetivo: Realizar um tratamento de dados com a finalidade de treinar um algoritmo de Machine Learning

### 1. Importação das Bibliotecas

In [2]:
import pandas as pd
import numpy as np
import requests

### 2. Importando base de dados por uma API

In [2]:
response = requests.get("https://api.api-onepiece.com/v2/characters/en?name=Monkey%20D%20Luffy")

if response.status_code == 200:
    data = response.json()
    df = pd.json_normalize(data) 
else:
    print("Erro:", response.status_code, response.text)


### 3. Análise de Dados

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 773 entries, 0 to 772
Data columns (total 22 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   773 non-null    int64  
 1   name                 773 non-null    object 
 2   size                 761 non-null    object 
 3   age                  761 non-null    object 
 4   bounty               756 non-null    object 
 5   job                  772 non-null    object 
 6   status               772 non-null    object 
 7   crew.id              738 non-null    float64
 8   crew.name            738 non-null    object 
 9   crew.description     0 non-null      float64
 10  crew.status          738 non-null    object 
 11  crew.number          393 non-null    object 
 12  crew.roman_name      449 non-null    object 
 13  crew.total_prime     393 non-null    object 
 14  crew.is_yonko        738 non-null    object 
 15  fruit.id             178 non-null    flo

### 4. Tratamento de Dados

#### 4.1 Removendo colunas desnecessárias

In [4]:
df = df.drop(columns=['id','status','crew.id', 'crew.name', 'crew.description', 'crew.status', 'crew.number', 'crew.total_prime', 'crew.is_yonko', 'fruit.id', 'fruit.name', 'fruit.description', 'fruit.filename', 'fruit.technicalFile'])

#### 4.2 Padronizando os valores vazios:

In [5]:
df["fruit.type"] = df["fruit.type"].fillna("Don't have")
df["fruit.roman_name"] = df["fruit.roman_name"].fillna("Don't have")
df["crew.roman_name"] = df["crew.roman_name"].fillna("Don't have")

#### 4.3 Convertendo o valor da recompensa para um tipo numérico:

In [6]:
df['bounty'] = df['bounty'].str.replace('.', '', regex=False)
df['bounty'] = pd.to_numeric(df['bounty'], errors='coerce')
df['bounty'] = df['bounty'].fillna(0)

#### 4.4 Convertendo o tamanho dos personagem para um tipo numérico:

In [7]:
df.replace({'size': ''}, np.nan, inplace=True)
df = df.dropna(subset=['size'])
df['size'] = df['size'].str.replace('cm', '', regex=False)
df['size'] = df['size'].str.replace(' ', '', regex=False)
df['size'] = pd.to_numeric(df['size'], errors='coerce')

#### 4.5 Convertendo a idade dos personagens para um tipo numérico:

In [8]:
df.replace({'age': ''}, np.nan, inplace=True)
df = df.dropna(subset=['age'])
df['age'] = df['age'].str.replace('ans', '', regex=False)
df['age'] = df['age'].str.replace(' ', '', regex=False)
df['age'] = pd.to_numeric(df['age'], errors='coerce')

#### 4.6 Tratando a ocupação do usuário:

In [9]:
df.replace({'job': ''}, np.nan, inplace=True)
df["job"] = df["job"].fillna("Don't have")

#### 4.7 Removendo Outliers

In [10]:
df = df.drop([760, 602, 255])

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 354 entries, 0 to 753
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              354 non-null    object 
 1   size              354 non-null    float64
 2   age               354 non-null    int64  
 3   bounty            354 non-null    float64
 4   job               354 non-null    object 
 5   crew.roman_name   354 non-null    object 
 6   fruit.type        354 non-null    object 
 7   fruit.roman_name  354 non-null    object 
dtypes: float64(2), int64(1), object(5)
memory usage: 24.9+ KB


### 5. Criando Classes

In [12]:
def define_survival(row):
    score = 0
    
    # Regra 1: recompensa
    if row['bounty'] > 1e8:
        score += 2
    elif row['bounty'] > 1e6:
        score += 1
    
    # Regra 2: fruta
    if row['fruit.type'].lower() == 'logia':
        score += 3
    elif row['fruit.type'].lower() == 'zoan':
        score += 2
    elif row['fruit.type'].lower() == 'paramecia':
        score += 1
    
    # Regra 3: job
    if isinstance(row['job'], str):
        job = row['job'].lower()
        if 'captain' in job or 'swordsman' in job or 'right-hand' in job:
            score += 2
        elif 'sniper' in job or 'fighter' in job:
            score += 1
    
    # Regra 4: idade
    if row['age'] < 16 or row['age'] > 70:
        score -= 1
    
    # Regra 5: altura
    if row['size'] < 100 or row['size'] > 250:
        score -= 1
    
    # decisão final
    return 1 if score >= 3 else 0


In [13]:
df['survive'] = df.apply(define_survival, axis=1)

In [14]:
df

Unnamed: 0,name,size,age,bounty,job,crew.roman_name,fruit.type,fruit.roman_name,survive
0,Monkey D Luffy,174.0,19,3.000000e+09,Captain,Mugiwara no Ichimi,Paramecia,Gomu Gomu no Mi,1
1,Roronoa Zoro,181.0,21,3.200000e+08,Right-hand man,Mugiwara no Ichimi,Don't have,Don't have,1
2,Nami,170.0,20,6.600000e+07,Navigator,Mugiwara no Ichimi,Don't have,Don't have,0
3,Usopp,176.0,19,2.000000e+08,Sniper,Mugiwara no Ichimi,Don't have,Don't have,1
4,Sanji,180.0,21,3.300000e+08,Cook,Mugiwara no Ichimi,Don't have,Don't have,0
...,...,...,...,...,...,...,...,...,...
689,Bartholomew Kuma,689.0,47,2.960000e+08,Lieutenant,Don't have,Paramecia,Nikyu Nikyu no Mi,0
690,César Clown,309.0,40,3.000000e+08,Don't have,Don't have,Logia,Gasu Gasu no Mi,1
691,Morgans,305.0,53,0.000000e+00,World Economic Journal (boss),Don't have,Zoan,Tori Tori no Mi Moderu Arubatorosu,0
734,Zéphyr,348.0,72,0.000000e+00,Chef,Don't have,Don't have,Don't have,0


### 6. Salvando Base de dados em arquivo CSV

In [15]:
df.to_csv("one_piece_dataset.csv")

## Treinando um modelo de Machine Learning

In [77]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from imblearn.over_sampling import SMOTE
from sklearn.metrics import classification_report

import joblib

### 1. Importando a Base de Dados

In [60]:
df = pd.read_csv("one_piece_dataset.csv")
df

Unnamed: 0.1,Unnamed: 0,name,size,age,bounty,job,crew.roman_name,fruit.type,fruit.roman_name,survive
0,0,Monkey D Luffy,174.0,19,3.000000e+09,Captain,Mugiwara no Ichimi,Paramecia,Gomu Gomu no Mi,1
1,1,Roronoa Zoro,181.0,21,3.200000e+08,Right-hand man,Mugiwara no Ichimi,Don't have,Don't have,1
2,2,Nami,170.0,20,6.600000e+07,Navigator,Mugiwara no Ichimi,Don't have,Don't have,0
3,3,Usopp,176.0,19,2.000000e+08,Sniper,Mugiwara no Ichimi,Don't have,Don't have,1
4,4,Sanji,180.0,21,3.300000e+08,Cook,Mugiwara no Ichimi,Don't have,Don't have,0
...,...,...,...,...,...,...,...,...,...,...
349,689,Bartholomew Kuma,689.0,47,2.960000e+08,Lieutenant,Don't have,Paramecia,Nikyu Nikyu no Mi,0
350,690,César Clown,309.0,40,3.000000e+08,Don't have,Don't have,Logia,Gasu Gasu no Mi,1
351,691,Morgans,305.0,53,0.000000e+00,World Economic Journal (boss),Don't have,Zoan,Tori Tori no Mi Moderu Arubatorosu,0
352,734,Zéphyr,348.0,72,0.000000e+00,Chef,Don't have,Don't have,Don't have,0


### 2. Análise de Dados

In [61]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354 entries, 0 to 353
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Unnamed: 0        354 non-null    int64  
 1   name              354 non-null    object 
 2   size              354 non-null    float64
 3   age               354 non-null    int64  
 4   bounty            354 non-null    float64
 5   job               354 non-null    object 
 6   crew.roman_name   354 non-null    object 
 7   fruit.type        354 non-null    object 
 8   fruit.roman_name  354 non-null    object 
 9   survive           354 non-null    int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 27.8+ KB


In [6]:
df['survive'].value_counts()

survive
0    310
1     44
Name: count, dtype: int64

### 3. Separando Base de dados entre treino e teste:

In [78]:
x_data = df.iloc[:, 1:9].values
y_data = df.iloc[:, 9].values

x_data, y_data

(array([['Monkey D Luffy', 174.0, 19, ..., 'Mugiwara no Ichimi',
         'Paramecia', 'Gomu Gomu no Mi'],
        ['Roronoa Zoro', 181.0, 21, ..., 'Mugiwara no Ichimi',
         "Don't have", "Don't have"],
        ['Nami', 170.0, 20, ..., 'Mugiwara no Ichimi', "Don't have",
         "Don't have"],
        ...,
        ['Morgans', 305.0, 53, ..., "Don't have", 'Zoan',
         'Tori Tori no Mi Moderu Arubatorosu'],
        ['Zéphyr', 348.0, 72, ..., "Don't have", "Don't have",
         "Don't have"],
        ['Orlombus', 480.0, 42, ..., 'Yonta Maria Dai-senda',
         "Don't have", "Don't have"]], shape=(354, 8), dtype=object),
 array([1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
        0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1,
        1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 

### 4. Aplicando Pré-processamento:

In [79]:
colunas_numericas = [index for index in range(x_data.shape[1]) if type(x_data[0][index]) == int or type(x_data[0][index]) == float]
colunas_objetos = [index for index in range(x_data.shape[1]) if type(x_data[0][index]) ==  str or type(x_data[0][index]) == np.str_] 

colunas_numericas, colunas_objetos

([1, 2, 3], [0, 4, 5, 6, 7])

In [80]:
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OrdinalEncoder())
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, colunas_numericas),
        ("cat", categorical_transformer, colunas_objetos)
    ],
    remainder='passthrough'
)

x_data = preprocessor.fit_transform(x_data)
x_data

array([[-1.61516527e-01, -6.04387747e-01,  4.61025994e+00, ...,
         2.10000000e+01,  2.00000000e+00,  2.20000000e+01],
       [-1.58904417e-01, -5.53159862e-01,  2.66104000e-01, ...,
         2.10000000e+01,  0.00000000e+00,  1.50000000e+01],
       [-1.63009161e-01, -5.78773805e-01, -1.45618242e-01, ...,
         2.10000000e+01,  0.00000000e+00,  1.50000000e+01],
       ...,
       [-1.12632753e-01,  2.66486301e-01, -2.52601187e-01, ...,
         5.00000000e+00,  4.00000000e+00,  9.60000000e+01],
       [-9.65869337e-02,  7.53151210e-01, -2.52601187e-01, ...,
         5.00000000e+00,  0.00000000e+00,  1.50000000e+01],
       [-4.73300009e-02, -1.52670675e-02, -1.27000377e-02, ...,
         3.10000000e+01,  0.00000000e+00,  1.50000000e+01]],
      shape=(354, 8))

### 5. Tratando desbalanceamento de classes:

In [81]:
over_sampling = SMOTE(sampling_strategy="minority")
x_data, y_data = over_sampling.fit_resample(x_data, y_data)

x_data, y_data 

(array([[-1.61516527e-01, -6.04387747e-01,  4.61025994e+00, ...,
          2.10000000e+01,  2.00000000e+00,  2.20000000e+01],
        [-1.58904417e-01, -5.53159862e-01,  2.66104000e-01, ...,
          2.10000000e+01,  0.00000000e+00,  1.50000000e+01],
        [-1.63009161e-01, -5.78773805e-01, -1.45618242e-01, ...,
          2.10000000e+01,  0.00000000e+00,  1.50000000e+01],
        ...,
        [-1.55677470e-01, -3.11528339e-01, -5.79018432e-02, ...,
          2.08265626e+01,  1.82656259e+00,  2.50460942e+01],
        [ 6.54619974e-04,  2.00662998e-01,  5.09590225e+00, ...,
          1.29996254e+01,  4.85735691e+00,  9.27154630e+01],
        [-1.54688103e-01, -4.09359576e-01,  5.90803643e-02, ...,
          1.39710427e+01,  0.00000000e+00,  1.50000000e+01]],
       shape=(620, 8)),
 array([1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
        0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1,
        1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0,

### 6. Separando as variáveis entre treino e teste:

In [82]:
X_train, X_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.3, random_state=42)

### 7. Realizando o Treinamento e Validação

#### 7.1 Regressão Logística

In [83]:
modelLogistic = LogisticRegressionCV(solver="liblinear")
modelLogistic.fit(X_train, y_train)
y_pred = modelLogistic.predict(X_test)

In [84]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.87      0.84      0.85       104
           1       0.80      0.84      0.82        82

    accuracy                           0.84       186
   macro avg       0.84      0.84      0.84       186
weighted avg       0.84      0.84      0.84       186



#### 7.2 Random Forest Classifier

In [85]:
modelRFC = RandomForestClassifier()
modelRFC.fit(X_train, y_train)
y_pred = modelRFC.predict(X_test)

In [86]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      0.93      0.96       104
           1       0.92      0.98      0.95        82

    accuracy                           0.95       186
   macro avg       0.95      0.95      0.95       186
weighted avg       0.95      0.95      0.95       186



#### 7.3 KNN

In [87]:
modelknn = KNeighborsClassifier()
modelknn.fit(X_train, y_train)
y_pred = modelknn.predict(X_test)

In [88]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.86      0.86      0.86       104
           1       0.82      0.82      0.82        82

    accuracy                           0.84       186
   macro avg       0.84      0.84      0.84       186
weighted avg       0.84      0.84      0.84       186



### 8. Teste como se fosse em uma aplicação:

In [98]:
registro = np.array([["Bartholomew Kuma",689.0,47,296000000.0,"Lieutenant","Don't have","Paramecia","Nikyu Nikyu no Mi"]])

In [93]:
registro = preprocessor.transform(registro)

In [94]:
predict = modelRFC.predict(registro)

predict

array([0])

### 9. Importando um modelo:

In [95]:
pipeline_predict = Pipeline(steps=[
                                ('preprocessor', preprocessor), 
                                ('modelRFC', modelRFC)
                            ])

In [101]:
joblib.dump(pipeline_predict, "onepiece_survival_model.pkl")

['onepiece_survival_model.pkl']