# Your first predictions - Predicting Salaries 📈

--------------

## But first - a tutorial on `Jupyter Notebook` and `Python` basics 🚴‍♀️

### Jupyter Notebook 📝

Notebook consists of two main parts.

1. Text instructions like this one - these are made using a text formatting language called [Markdown](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet)

2. Code cells like the one below:

In [1]:
1 + 1 * 2

3

1. To run a code cell, click into it with your mouse and press the `► Run` button in the navbar at the top of the notebook. 
2. You can also use the shortcut `Shift + Enter` to run a cell!
3. A cell that has been run will get a `In [number]` next to it
4. An output (returned value) of a cell will be displayed below with a `Out[number]` next to it
5. If you want to add another code cell - look for the `➕` button in the navbar.

In [2]:
# you will have cells like these for you to code in

--------

### 🐍Python basics

[**Python**](https://docs.python.org/) has been around since late 1980s. In fact, Machine Learning concept has been around since 1950s! 😯

But rapid advances in internet speed, data storage and the very active Python community has married the two things very well in the last 5 years.

In **Python** we have **built-in data types** to help us work with different kinds of data:

**Strings** (`str` in Python) for **literal text, column or file names**. Made by putting quotes (`""`) around the text.

In [3]:
"hello!"
"ML like a pro"

'ML like a pro'

**Integers** (`int` in Python) for **whole numbers**

In [4]:
42-10

32

**Floats** (`float` in Python) for **numbers with decimal points**. The decimal delimeter is always `.`

In [5]:
3.14

3.14

These **numeric** types accept all standard math calculations:

In [6]:
9 / 3
2 + 5
2 > 1

True

📦 We have **variables** to help store data:

In [7]:
name = "Alan Turing"
age = 42
new_employee_data = [0, 30, 3, 7.1, 12]

...and **re-use** it later!:

In [8]:
"Hi, my name is " + name

'Hi, my name is Alan Turing'

In [9]:
# getting one year older :(
age = age + 1
age

43

💥And we have **methods** to perform actions on data:

In [10]:
name.upper()

'ALAN TURING'

In [11]:
number_of_n = name.count('n') # creating a new variable as a result of the method call
number_of_n

2

### 1. Your turn! 🚀
Practice using some of the basic types we just covered. Here are some ideas:

* Create two strings and add them together with a `+` sign
* Create a variable with your age in years, then count your age in hours (roughly)
* Check if your birth month number is higher than (`>`) than your birth day number
* Create a variable with your full name, then tell yourself that you rock in all caps! 💪 (ie. `"YOU ROCK ALAN TURING!"`)

In [12]:
# your code here
from datetime import date

name = 'gui'
last_name = 'fontana'

age = 41 * 365
age_hours = age * 24
age_hours

birth_day = 6
birth_month = 6

birth_day > birth_month

full_name = f'YOU ROCK {name} {last_name}!'
full_name

'YOU ROCK gui fontana!'

Don't worry if some things feel unnatural at first - you are learning a new language in just 20 minutes! 💪

--------------

# Let's get back into Data Science 🤖

1. ### Aqui vamos importar as bilbiotecas do python que vamos utilizar para desenlvolver nosso modelo.

In [13]:
import pandas as pd
import numpy as np
import seaborn as sns

2. Aqui vamos executar essa célula para ler o arquivo CSV em formato de um panda DataFrame - que é o formato que utilizamos para análise de dados dentro do Python

*Note: the datasets is cleaned and federated for learning purposes*<br>
esses dados já estão limpos e processados para fins de aprendizado ta pessoal.

In [14]:
salaries = pd.read_csv('clean_data/salaries.csv')
salaries

Unnamed: 0,Gender,Age,Department,Department_code,Years_exp,Tenure (months),Gross
0,0,25,Tech,7,7.5,7,74922
1,1,26,Operations,3,8.0,6,44375
2,0,24,Operations,3,7.0,8,82263
3,0,26,Operations,3,8.0,6,44375
4,0,29,Engineering,0,9.5,25,235405
...,...,...,...,...,...,...,...
1797,0,29,Other,4,9.5,34,88934
1798,0,27,Engineering,0,8.5,33,133224
1799,0,29,Operations,3,9.5,15,72547
1800,0,47,Other,4,18.5,30,227176


--------------

## We can get a lot of insight without ML! 🤔

### 2. Your turn! 🚀

Vamos começar entendendo os dados que temos - quão grande é o conjunto de dados, quais são as informações (colunas) que temos e assim por diante

**💡 Tip:** remember to check the slides for the right methods ;)

In [36]:
# your code here

<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
salaries.shape # para ver quantas linhas, colunas
salaries.dtypes # para ver as colunas disponíveis e seus tipos de dados
round(salaries.describe()) # para ver um resumo legível sobre o conjunto de dados, como médias, mínimos e máximos
</pre>
</details>

Agora tente separar apenas algumas colunas - digamos que queremos ver apenas departamentos, ou departamentos e salários:

In [37]:
# your code here

<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
salaries["Department"] # para ver uma coluna
salaries[["Department", "Gross"]] # colchete duplo se quisermos ver várias colunas
As séries podem conter apenas uma única lista com índice, enquanto o dataframe pode ser 
composto por mais de uma série - podemos dizer que um dataframe é uma coleção de séries
que podem ser usadas para analisar os dados.
</pre>
</details>

-------

### 3. Your turn - Now let's do some **visualization** 📊. 


Vamos seguir nossa intuição - anos de experiência afetam o salário bruto❓

Let's use a [Seaborn Scatterplot](https://seaborn.pydata.org/generated/seaborn.scatterplot.html) - a method inside the Seaborn library (which we imported above and shortened to `sns`) that gives us a graph with data points as dots with `x` and `y` values.

In [38]:
# your code here

<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
sns.scatterplot(data=salaries, x="Years_exp", y="Gross")
</pre>
</details>

Lembrando de uma das perguntas dos slides - mulheres e homens recebem igualmente neste exemplo❓

*Note: 'Male' is coded as 0, 'Female' - as 1*

In [39]:
# your code here

<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
sns.scatterplot(data=salaries, x="Years_exp", y="Gross", hue="Gender")
</pre>
</details>

Vamos também entender o número de alguns pontos que temos - quantas mulheres e homens? Quantos em cada departamento? Seaborn countplot está aqui para ajudar com isso.

**💡 Tip:** you can always call methods `.dtypes` or `.columns` on your dataset to check what columns you have.

In [40]:
# your code here

<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
sns.countplot(data=salaries, x="Gender") # para ver quantos de cada gênero temos no conjunto de dados
sns.countplot(data=salaries, x="Department") # para ver quantos de cada departamento temos
sns.set(rc = {'figure.figsize':(10,6)})
</pre>
</details>

**Bonus question:** can you visualize **how many men and women there are per department**? 🤔 A `hue` might help...

In [41]:
# your code here


<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
sns.countplot(data=salaries, x="Department", hue="Gender")
</pre>
</details>

--------------

#### 🥈*A good data expert knows all the most complex models.* 
### 🥇*A great data expert knows when results can be achieved without them.* 

--------------

## Your first model - Linear Regression 📈

**1.** First, let's create what will be our...
  * Features and target
  * Inputs and output
  * X and Y

In [42]:
# your code here


<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
features = salaries.drop(["Gross", "Department"], axis="columns") # dropping the Department column because it's text
target = salaries["Gross"]
</pre>
</details>

Feel free to check what is in your `features` and `target` below:

In [43]:
# your code here

**2.** Time to **import** the Linear Regression model

Python libraries like [Scikit-learn](https://scikit-learn.org/0.21/modules/classes.html) make it super easy for people getting into Data Science and ML to experiment.

The code is already in the library, it's just about **calling the right methods!** 🛠

In [44]:
# your code here


<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
from sklearn.linear_model import LinearRegression
</pre>
</details>

Now to **initialize** the model

In [45]:
# your code here


<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
model = LinearRegression()
</pre>
</details>

**3.** We **train** the model. 

Este é o processo em que o modelo de Regressão Linear procura uma linha que melhor se ajuste a todos os pontos do conjunto de dados. Esta é a parte em que o computador está trabalhando duro para aprender!! 🤖

In [46]:
# your code here


<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
model.fit(features, target)
</pre>
</details>

**4.** We **score** the model

Os modelos podem ter diferentes métricas de pontuação padrão. A Regressão Linear por padrão usa algo chamado `R-quadrado` - uma métrica que mostra o quanto de mudança na meta (salário bruto) pode ser explicada pelas mudanças nos recursos (Idade, Cargo, Gênero etc.)<br>
O QUE É O R2? é a porcentagem da variação da variável resposta que é explicada por um modelo linear. Ou:
R-quadrado = Variação explicada/Variação total

In [47]:
# your code here


<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
model.score(features, target)
</pre>
</details>

⚠️ Cuidado para não confundir isso com precisão. O número acima mostra que "as entradas que temos podem nos ajudar a prever em torno de 40-45% de mudança no salário" O que é decente, considerando que fizemos isso em 10 min!

**5.** Let's **predict** the salary of a new hire 🔮

*Observação: aqui está um lembrete das colunas na tabela:*`['Gender', 'Age', 'Department_code', 'Years_exp', 'Tenure (months)']`

In [31]:
X

Unnamed: 0,Gender,Age,Department_code,Years_exp,Tenure (months)
0,0,25,7,7.5,7
1,1,26,3,8.0,6
2,0,24,3,7.0,8
3,0,26,3,8.0,6
4,0,29,0,9.5,25


In [48]:
# here's a freebie! You can change the numbers below to change the info of your hire ;)
# Gender	Age	Department_code	Years_exp	Tenure (months)
# target Gross
hire = [[1, 41, 0, 5.2, 10]]

# your code here


<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
model.predict(hire)
</pre>
</details>

💡 A hint for **departments and their codes**:

* Engineering - 0
* Finance - 1
* Media - 2
* Operations - 3
* Other - 4
* Product - 5
* Sales - 6
* Tech - 7

--------------

# Congratulations, you are a Linear Regression wizzard! 🧙‍♀️🧙‍♂️

- Você pode tentar brincar com a variável de aluguel para ver os resultados `.prediction`
- Você também pode tentar alterar a variável features - tente remover mais colunas!
- Procurando um desafio maior? 🏋️‍♀️ Vá para o desafio opcional `2. KNN - Customer Churn` para explorar outro tipo de modelo