# Your first predictions - Predicting Salaries 📈

--------------

## But first - a tutorial on `Jupyter Notebook` and `Python` basics 🚴‍♀️

### Jupyter Notebook 📝

Notebook consists of two main parts.

1. Text instructions like this one - these are made using a text formatting language called [Markdown](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet)

2. Code cells like the one below:

In [6]:
1 + 1 * 3

1. To run a code cell, click into it with your mouse and press the `► Run` button in the navbar at the top of the notebook. 
2. You can also use the shortcut `Shift + Enter` to run a cell!
3. A cell that has been run will get a `In [number]` next to it
4. An output (returned value) of a cell will be displayed below with a `Out[number]` next to it
5. If you want to add another code cell - look for the `➕` button in the navbar.

In [None]:
# you will have cells like these for you to code in

--------

### 🐍Python basics

[**Python**](https://docs.python.org/) has been around since late 1980s. In fact, Machine Learning concept has been around since 1950s! 😯

But rapid advances in internet speed, data storage and the very active Python community has married the two things very well in the last 5 years.

In **Python** we have **built-in data types** to help us work with different kinds of data:

**Strings** (`str` in Python) for **literal text, column or file names**. Made by putting quotes (`""`) around the text.

In [10]:
print("hello!")
"1 + 1"

**Integers** (`int` in Python) for **whole numbers**

In [12]:
42
-10
0

**Floats** (`float` in Python) for **numbers with decimal points**. The decimal delimeter is always `.`

In [11]:
3.14
4.20

These **numeric** types accept all standard math calculations:

In [14]:
10 > 50

📦 We have **variables** to help store data:

In [15]:
name = "Pavel Liser"
age = 42
new_employee_data = [0, 30, 3, 7.1, 12]

...and **re-use** it later!:

In [16]:
"Hi, my name is " + name

In [17]:
# getting one year older :(
age = age + 1
age

💥And we have **methods** to perform actions on data:

In [18]:
name

In [21]:
name.upper()
name.count('e') # creating a new variable as a result of the method call

### 1. Your turn! 🚀
Practice using some of the basic types we just covered. Here are some ideas:

* Create two strings and add them together with a `+` sign
* Create a variable with your age in years, then count your age in hours (roughly)
* Check if your birth month number is higher than (`>`) than your birth day number
* Create a variable with your full name, then tell yourself that you rock in all caps! 💪 (ie. `"YOU ROCK ALAN TURING!"`)

In [None]:
# your code here

Don't worry if some things feel unnatural at first - you are learning a new language in just 20 minutes! 💪

--------------

# Let's get back into Data Science 🤖

1. Run below cell to `import` some Python libraries - these will be our tools for working with data 📊

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns

2. Run below cell to read the `CSV` file into a `DataFrame` - a format that is great for data analysis inside Python! 

*Note: the datasets is cleaned and federated for learning purposes*

In [3]:
salaries = pd.read_csv('https://wagon-public-datasets.s3.amazonaws.com/sprints/ml-salaries.csv')
salaries

--------------

## We can get a lot of insight without ML! 🤔

### 2. Your turn! 🚀

Let's start by **understanding the data we have** - how big is the dataset, what is the information (columns) we have and so on:

**💡 Tip:** remember to check the slides for the right methods ;)

In [None]:
# your code here

<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
salaries.shape # to see how many rows, columns
salaries.dtypes # to see available columns and their data type
round(salaries.describe()) # to see a readable summary about the dataset, like averages, minimums and maximums
</pre>
</details>

Now try to **separate only some columns** - say we only want to see departments, or departments and salaries:

In [None]:
# your code here

<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
salaries["Department"] # to see one column
salaries[["Department", "Salary"]] # double bracket if we want to see multiple columns
</pre>
</details>

-------

#### 🥈*A good data expert knows all the most complex models.* 
### 🥇*A great data expert knows when results can be achieved without them.* 

--------------

## Your first model - Linear Regression 📈

**1.** First, let's create what will be our...
  * Features and target
  * Inputs and output
  * X and Y

In [24]:
features = salaries.drop(['Salary', 'Department'], axis='columns')
target = salaries['Salary']


<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
features = salaries.drop(["Salary", "Department"], axis="columns") # dropping the Department column because it's text
target = salaries["Salary"]
</pre>
</details>

Feel free to check what is in your `features` and `target` below:

In [26]:
features

**2.** Time to **import** the Linear Regression model

Python libraries like [Scikit-learn](https://scikit-learn.org/0.21/modules/classes.html) make it super easy for people getting into Data Science and ML to experiment.

The code is already in the library, it's just about **calling the right methods!** 🛠

In [27]:
from sklearn.linear_model import LinearRegression


<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
from sklearn.linear_model import LinearRegression
</pre>
</details>

Now to **initialize** the model

In [28]:
model = LinearRegression()


<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
model = LinearRegression()
</pre>
</details>

**3.** We **train** the model. 

This is the process where the Linear Regression model looks for a line that best fits all the points in the dataset. This is the part where the computer is hard at work **learning**! 🤖

In [29]:
model.fit(features, target)


<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
model.fit(features, target)
</pre>
</details>

**4.** We **score** the model

Models can have different default scoring metrics. Linear Regression by default uses something called `R-squared` - a metric that shows how much of change in the target (Gross salary) can be explained by the changes in features (Age, Tenure, Gender etc.)

In [30]:
model.score(features, target)


<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
model.score(features, target)
</pre>
</details>

⚠️ **Careful not to confuse this with accuracy**. The above number is shows that **"the inputs we have can help us predict around 40-45% of change in the salary"** Which is decent considering we did this in 10 min! 

**5.** Let's **predict** the salary of a new hire 🔮

*Note: here is a reminder of the columns in the table:* `['Gender', 'Age', 'Department_code', 'Years_exp', 'Tenure (months)']`

In [38]:
# here's a freebie! You can change the numbers below to change the info of your hire ;)
hire = [[0, 19, 7, 1, 10]]

# your code here
model.predict(hire)


<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
model.predict(hire)
</pre>
</details>

💡 A hint for **departments and their codes**:

* Engineering - 0
* Finance - 1
* Media - 2
* Operations - 3
* Other - 4
* Product - 5
* Sales - 6
* Tech - 7

--------------

**6.** **Explaining** the model

There is a whole concept called [**Explainable AI (XAI)**](https://arxiv.org/abs/2006.00093) which is rising in popularity, as the widespread application of machine learning, particularly deep learning, has led to the development of highly accurate models but **models lack explainability and interpretability**.

Luckily, Linear Regression is a [linear model](https://scikit-learn.org/stable/modules/linear_model.html), so it's explainability is quite high.

**6.1.** We can check the `coef_` or the **coefficients** of the model. These explain how much the target (Gross salary) changes with a change of `1` in each of the features (inputs), while holding other features constant.

In [34]:
model.coef_


<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
model.coef_
</pre>
</details>

🤔 We'd need to check the column order again, to know which number is which input. But, **we got you covered!** Run the cell below:

In [35]:
pd.concat([pd.DataFrame(features.columns),pd.DataFrame(np.transpose(model.coef_))], axis = 1)

**6.2** The other thing we can check is the **intercept** of the model. This is the target (Gross salary) for when all inputs are 0. So imagine a newborn baby going to the office:

In [36]:
model.intercept_


<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
model.intercept_
</pre>
</details>

# Congratulations, you are a Linear Regression wizzard! 🧙‍♀️🧙‍♂️

* You can try to play around with the `hire` variable to see the `.predict`ion results
* You can also try to change the `features` variable - try removing more columns!
* Looking for a bigger challenge? 🏋️‍♀️ Go to the **optional challenge `2. KNN - Customer Churn`** to explore another model type