# Onboarding DS - Part 1

In this first notebook, we will work on some Python important content, that you may use during your projects on Data Science.
Up till now, you may have installed `Anaconda` or just `Jupyter`. Along this notebook, we will show you some packages and how to use them:
* [numpy](https://numpy.org/doc/stable/): working with matrices and arrays, and treating data
* [pandas](https://pandas.pydata.org/docs/index.html): working and treating tables (as DataFrames)

However, before using them, let's start with a more simple content.

## Using a Jupyter Notebook

In a Jupyter Notebook, each cell is run separately, which means you can run your code in any cell order you wish.
The results you obtain running a cell are kept and can be used in the other ones, like a regular Python code.
Whenever you want to run a cell, you can press the `run` button at the top of the screen or hit the `shift`+`enter` keys on your keyboard.

You can configure a cell to contain code (in Python or even R and Julia) or text (markdown), by selecting the desired option in the dropdown at the top. Markdown cells (like this one :) ) are helpful to structure your code in parts and explain your analysis.

## Variables

As in Maths, variables are used to store information. This information can be of many different types, such as:
* integer
* float
* string
* bool
* ...

Whenever you need to attribute a value to a variable, you can use the `=` operation. <br>
String values may be written between `""`or `''`. <br>
Bool variables can have `True` or `False` values.

In [1]:
v = 45
t = True

During your work as a data scientist, you will have to deal with a huge amount of data. Then, to manipulate them, you can use the structures below:

Note: each datatype has its specificities and methods you can use to manipulate them. We invite you to check more about it on the documentation (https://www.python.org/doc/ , https://www.w3schools.com/python/default.asp)

### List
> You can store many values inside a single list. These values do not have to be of the same type.
>
> a = [1, 2, 3]

### Tuple
> Like a list, you can store many values inside a tuple. However, the order of the elements is important and we cannot change (add, remove or substitute values) a tuple once it is created.
>
> a = (1, 2)

### Set
> You can also store many values inside a tuple. Like in Maths, a set do not contain duplicated values an do not have an order. You cannot change the value of one of its elements, but you can add and remove elements.
>
> a = {1, 3, 5, 7}

### Dictionary
> It is a way of mapping values. It is composed by key-value pairs.
>
> a = {"Neymar": "PSG", "CR7":"Manchester", "Messi": "PSG"}

<div class = 'alert alert-block alert-info'> Task 1: Create a new cell and a list called <b>list1</b>, with string values of your choice. Then, <b>append</b> the variable v created before in your list1.
    
Create a set of tuples of length = 2 whose values are floating-point numbers and call it <b>coordinates</b>. Your set must have at least 5 tuples. Then, create a variable <b>u</b> whose value is the sum of the elements of the third tuple from "coordinates".

## NumPy and Pandas

As mentioned before, these packages are **really** helpful when we are dealing with table-format data and matrices.
To install them, you should open your terminal and type:

> \>\> `pip3 install numpy` <br>
> \>\> `pip3 install pandas`

Or, if you are using Anaconda:

> \>\> `conda install numpy` <br>
> \>\> `conda install pandas`

After installing them, you need to import these packages into your code. As a convention, we always import the packages we use in the beginning of the code. To simplify referencing pandas and numpy when using their methods, we use `pd` as an alias for pandas and `np` for numpy.

In [2]:
import pandas as pd
import numpy as np

### NumPy

Data can be manipulated with this package if it is in NumPy array format.<br>
There are many things you can do easily with NumPy and we will show some of them here.

In [5]:
a = np.zeros((4,))
b = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

print(a)
print(b, "\n")
print(b.T, "\n") # transpose
print(np.mean(b, axis=0), "\n") # mean of each column

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]
[[ 1  5  9]
 [ 2  6 10]
 [ 3  7 11]
 [ 4  8 12]]
[5. 6. 7. 8.]


<div class = 'alert alert-block alert-info'> Task 2: Create a numpy array with 1 row and 3 columns and call it <b>c</b>. Then concatenate it with the array <b>a</b>. Calculate the sum of each row and each column of this new array using a numpy method.

### Pandas

With Pandas, our data must be in DataFrames (or Series).

In [9]:
df = pd.DataFrame(
    {
        "Name": [
            "A",
            "B",
            "C",
            "D",
            "E"
        ],
        "Age": [22, 35, 58, 17, 42],
        "Sex": ["male", "male", "female", "female", "male"],
        "State": ["SP", "MG", "MG", "AC", "SP"],
    }
)

display(df)
display(df.shape)
display(df.sort_values(by="Age")) # ordering the record with respect to the age
display(df.groupby("State").size()) # counting the number of people in each state

Unnamed: 0,Name,Age,Sex,State
0,A,22,male,SP
1,B,35,male,MG
2,C,58,female,MG
3,D,17,female,AC
4,E,42,male,SP


(5, 4)

Unnamed: 0,Name,Age,Sex,State
3,D,17,female,AC
0,A,22,male,SP
1,B,35,male,MG
4,E,42,male,SP
2,C,58,female,MG


State
AC    1
MG    2
SP    2
dtype: int64

<div class = 'alert alert-block alert-info'> Task 3: Convert <b>b</b> into a dataframe. Then concatenate it with the array <b>a</b>.
    
Now, find out the mean age of each sex. (Tip: read about the operations and aggregation functions you can do with `groupby`)