# Data Science Small Test
For Hospital Albert Einstein and the position of 'Cientista de Dados/CRM'. You can also access the repository on GitHub: https://github.com/israelmendez232/teste-cientista-dados-crm

---

We start by configurating the main libraries and reading the data to understand it better:

In [15]:
import pandas as pd
import numpy as np
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

columns = ["x", "y", "z", "label"]

dfPoints = pd.read_csv("df_points.txt", delimiter="\t", usecols = ["x", "y", "z", "label"])

print("First 5 lines to see a fraction of the data:")
dfPoints.head()

First 5 lines to see a fraction of the data:


Unnamed: 0,x,y,z,label
0,326.488285,188.988808,-312.205307,0.0
1,-314.287214,307.276723,-179.037412,1.0
2,-328.20891,181.627758,446.311062,1.0
3,-148.65889,147.027947,-27.477959,1.0
4,-467.065931,250.467651,-306.47533,1.0


### Understanding much more the data, as their type and some estatistics.

In [16]:
print("To understand what is the data types of each column:")
print(dfPoints.dtypes)

print("\n\nAnd to have more details from a statistical perspective of the data:")
print(dfPoints.describe())

To understand what is the data types of each column:
x        float64
y        float64
z        float64
label    float64
dtype: object


And to have more details from a statistical perspective of the data:
                  x             y             z         label
count  10000.000000  10000.000000  10000.000000  10000.000000
mean       0.850362     -3.108769     -2.601124      0.502700
std      288.379928    287.120263    290.379789      0.500018
min     -499.802348   -499.899134   -499.952571      0.000000
25%     -249.199895   -248.954580   -258.005693      0.000000
50%        3.663472     -5.446168     -8.221000      1.000000
75%      248.879970    244.395864    252.930406      1.000000
max      499.872453    499.752418    499.872329      1.000000


## We define the data and test the model:
Defining the target and columns for avaliation of the model

In [17]:
Y = dfPoints.label
X = dfPoints[columns]

train_X, val_X, train_Y, val_Y = train_test_split(X, Y, test_size=0.3, random_state=1)

## Train the model
modelLR = LogisticRegression(n_jobs = 1, C = 1e5, solver = 'lbfgs')
modelLR.fit(train_X, train_Y)

predictionLR = modelLR.predict(val_X)
accuracyLR = modelLR.score(val_X, val_Y)
print(f"Accuracy in Logistic Regression: {accuracyLR * 100}%")

Accuracy in Logistic Regression: 100.0%


## My method
In the ideal world, we would need to test several models to see which one performs better, but I'll be limiting to only 3 other models and compare their performance. Which is:

- Random Forest;
- Decision Trees;
- Linear Regression.

Here is their test:

---

### Random Forest

In [18]:
from sklearn.ensemble import RandomForestClassifier

modelRF = RandomForestClassifier(n_estimators = 1000, min_samples_leaf = 10)
modelRF.fit(train_X, train_Y)

# Validate and testing the model:
accuracyRF = modelRF.score(val_X, val_Y)
print(f"Accuracy in Random Forest: {accuracyRF * 100}%")

Accuracy in Random Forest: 100.0%


### Decision Trees

In [19]:
from sklearn.tree import DecisionTreeClassifier

modelDT = DecisionTreeClassifier()
modelDT.fit(train_X, train_Y)

# Validate and testing the model:
accuracyDT = modelDT.score(val_X, val_Y)
print(f"Accuracy in Decision Trees: {accuracyDT * 100}%")

Accuracy in Decision Trees: 100.0%


### Linear Regression

In [20]:
from sklearn import linear_model

modelLinR = linear_model.LinearRegression()
modelLinR.fit(train_X, train_Y)

# Validate and testing the model:
accuracyLinR = modelLinR.score(val_X, val_Y)
print(f"Accuracy in Linear Regression: {accuracyLinR * 100}%")

Accuracy in Linear Regression: 100.0%


---

## And what is the Best Model?
Here is the visual representation of the best model for this problem.

In [21]:
import seaborn as sns

sns.set(style="white", context="talk")

dataset = {
    "Linear Regression", "Decision Trees", "Random Forest", "Logistic Regression",
    accuracyLinR,        accuracyDT,       accuracyRF,      accuracyLR
}

print(dataset)

data = sns.load_dataset(dataset)

sns.barplot(x = "Models", y = "Accuracy", data = data)

{1.0, 'Logistic Regression', 'Linear Regression', 'Random Forest', 'Decision Trees'}


HTTPError: HTTP Error 400: Bad Request



**Conclusion:** Texto texto.