# First look at Python

Python is a programming language that is extensively used in the tech industry. 

It allows to do a vast number of things: you can build websites, automatise things in your computer, machine learning and data analysis, scientific simulation and scientific research... It's an extremely useful tool!

In [None]:
print("Hello! This is a very simple statement in Python. It prints text.")

In [None]:
from datetime import date 
print(f"Today's date is {date.today()}")


In [None]:
print(f"Python can also do mathematics, for example: 2+500={2+500}")

We can also do more complicated things (for example, more complicated mathematics) by importing **libraries**, which consist of a lot of code that other people have written with an application in mind, and that we can reuse. 

Here, we import the math library, that allows us to do more complex mathematics

In [None]:
import math # we tell python we want to use this library
print(f"Also somewhat complicated mathematics, sqrt(72)~= {math.sqrt(72):.2f}")

print(f"Compare sin(pi/4 = 45°) = {math.sin(math.pi / 4)} with sqrt(2)/2 = {math.sqrt(2)/2}")

In particular, for data science there is a library that is widely used.

It's called **pandas** and can load data files as Pandas **dataframes**. 

To give this a test, we're going to load the Titanic dataset from a [csv file that's available online.](https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv)

A `csv` file is a bit like an excel file, albeit much simpler: the acronym means **c**omma **s**eparated **v**alues. Go have a look!

In [None]:
import pandas as pd 

#load the file from the internet
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv')

#show the first 5 lines
df.head(5)

In [None]:
# we can do some easy computations: what is the average age by gender ? how many people are there in each class ? 

df.groupby('sex')['age'].mean()

In [None]:
# percentage of survivors per class and sex

df.groupby(['class', 'sex'])['survived'].mean()*100

# Profiling

We can do some simple "profiling", i.e. exploration of the data to understand it

In [None]:
# number of rows and columns, printed as (rows, columns)

df.shape

In [None]:
df.columns

In [None]:
df['age'].describe()

In [None]:
print("\nSurvived values are:", df['survived'].unique())
print("\nPclass values are:", df['pclass'].unique())
print("\nSex values are:", df['sex'].unique())
print("\nEmbark town values are:", df['embark_town'].unique())


For example `nan` indicates missing values ! How many are there ? 

In [None]:
df['embark_town'].isna().sum()

Two missing values for embark town. 

This can be relevant when you're working with data: you want to understand it first, before doing anything complicated like training a machine learning model!

The advantage of using Python vs. Pyramid or Excel is that Python is **much better suited** to handle complex/large databases and to do things like machine learning. 

However, learning Python can imply a **steep learning curve**, in particular if you have no experience programming. Always remember that you should use the tool that is adapted to the task at hand: if your data needs are satisfied by Pyramid/Excel, no need to use Python at all costs.

There are lots of libraries used to do data science in python: you can work with data using things like `pandas`, which is what we will briefly see today; but bear in mind that there are alternatives for working with large datasets (several tens of gb) like `dask` and `spark`.

In addition to these libraries that *handle* data, there are libraries that can do a lot of analysis, including machine learning. These are things like `scikit-learn`, `tensorflow`, `pytorch`... 