# Week 0: The "Why?" and "How?" of Python, Pandas, NumPy ?

This week is an informal kick-off for the mini-intro to Python, Pandas, NumPy and visualisations in Python. The notebook is mostly a collection of useful resources to get one started, with a very small motivating examples.

## Python
<!-- (![Python logo](https://www.python.org/static/community_logos/python-logo-master-v3-TM.png) This would work but it doesn't resize immage -->
<img src="https://www.python.org/static/community_logos/python-logo-master-v3-TM.png" alt="Python logo" width=200 style="float: right;"  />

If you're reading this, you don't need convincing that Python is a useful langauge to know. It consistently places as one of the most popular programming languages, is relatively easy to learn at the basic level, has gained popularity in Data Science and Machine Learning communities. It is known to be used by a number or large organisations such as Google, CERN, Facebook, Amazon, Spotify, etc. Its large standard library of packages is recognised as one of Python's strenghts.

Enough said, if you wand to learn more use google, or even have a look at [wikipedia entry for Python](https://en.wikipedia.org/wiki/Python_(programming_language)).

### Tutorial
Lots of good tutorials online, W3Schools one or docs.python.org offer two great choices
- ***recommended*** https://www.w3schools.com/python/default.asp 
- ***recommended*** https://docs.python.org/3/tutorial/index.html


- https://wiki.python.org/moin/BeginnersGuide/Programmers - has a list of tutorials to suit all styles of learning
- or use Pluralsight - request company-founded account if you don't already have it

### Working environment
**Notebooks** are a great alternative to start with to using **.py** files, useful to know how to work with both.

A few things to look into:
- installing **Anaconda**
- **Jupyter Notebooks** or **Jupyter Lab** as simple IDEs
- **PyCharm** or **Visual Studio Code** as alternative IDEs
- at home think about installing packages using **conda** and **pip**

### Varia
- start with datasets in files (CSV, JSON, etc), there's plenty of them on ***Kaggle***
    - or google to find lots of sources of great datasets, e.g. start here https://collaborativedataplatform.com/blog/basics/top-25-public-datasets/
    
- if you feel you really must, you can see how one could run SQL  
    - https://www.datacamp.com/tutorial/tutorial-how-to-execute-sql-queries-in-r-and-python 

In [65]:
from platform import python_version

print(python_version())

3.9.13


In [66]:
import sys
print("A:",sys.executable)
print("B:",sys.version)
print("C:",sys.version_info)

A: C:\Users\Ola\anaconda3\python.exe
B: 3.9.13 (main, Aug 25 2022, 23:51:50) [MSC v.1916 64 bit (AMD64)]
C: sys.version_info(major=3, minor=9, micro=13, releaselevel='final', serial=0)


In [67]:
print("Hello world!")
for i in range(1,20,3):
    print(i," ",end="")
    
my_string = "Python is fun"
print("\n\nJust another example - let's revert string ('{}' -> '{}')".format(my_string,my_string[::-1]))

Hello world!
1  4  7  10  13  16  19  

Just another example - let's revert string ('Python is fun' -> 'nuf si nohtyP')


## Numpy
- package for scientific computing 
- wide variety of mathematical operations on arrays
- vectorisation - efficient calculations with arrays and matrices
- wide range of mathmatical functions
- basis for pandas, matplotlib, etc
- optimised and pre-compiled C code

### API reference
- https://numpy.org/doc/stable/reference/index.html

### Tutorials
- https://numpy.org/doc/stable/user/absolute_beginners.html
- https://www.w3schools.com/python/numpy/default.asp
- lots others depending on your style; google is your friend

### Exercises
1. https://www.machinelearningplus.com/python/101-numpy-exercises-python/
1. https://www.w3schools.com/python/numpy/exercise.asp
1. https://www.w3resource.com/python-exercises/numpy/index.php

Many different ways to multiply two vectors. The examples are a bit too artificial to show meaningful timings, but it's fun to see the below anyway.

Some nice illustration of timing comparisons https://medium.com/pythoneers/vectorization-in-python-an-alternative-to-python-loops-2728d6d7cd3e and https://www.programiz.com/python-programming/numpy/vectorization

In [68]:
import numpy as np

size=60000
a = b = [10 for _ in range(0,size)]

In [69]:
%%timeit

c = []
for i in range(len(a)):
    c.append(a[
        i]*b[i])

sum(c)
#assert sum(c) == 6000000

33.1 ms ± 7.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [70]:
%%timeit

d = np.array(a)*np.array(b)
d.sum()
#assert d.sum() == 6000000

27.3 ms ± 3.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [71]:
%%timeit

an=np.array(a)
bn=np.array(b) 
an.dot(bn)
#assert an.dot(bn) == 6000000

24.3 ms ± 5.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [72]:
%%timeit

# all list comprehensions below work, some will be more efficient than others 
#f = [a[i]*b[i] for i,_ in enumerate(a)]
#f = [a[i]*b[i] for i in range(1,size)]
f = [x*y for x,y in zip(a,b)]
sum(f)
#assert sum(f) == 6000000

10.6 ms ± 2.83 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [73]:
np.array(a).dot(np.array(b))

6000000

## Pandas
- Open source data analysis and manipulation tool
- Built on top of the Python programming language
- Organised around 2 key data structures - _Data Frames_ and _Series_ 
- Data can be imported from various file formats e.g.comma-separated values, JSON, Parquet, SQL database tables or queries, Microsoft Excel, and more
- Some of data operations often done with Pandas
    - Data cleansing
    - Merges and joins, grouping
    - Data inspection
    - Dealing with missing data
    - Data visualization
    - Loading and saving data
    - Data normalization
    - Statistical analysis
- robost I/O tools, time series-specific functionality, split-apply-combine operations, and so on

### API
https://pandas.pydata.org/pandas-docs/stable/reference/index.html

### Tutorials
- 10 minutes to Pandas: https://pandas.pydata.org/docs/user_guide/10min.html
- W3Schools: https://www.w3schools.com/python/pandas/default.asp


### Exercises
1. https://www.machinelearningplus.com/python/101-pandas-exercises-python/
20. https://www.w3schools.com/python/pandas/exercise.asp
41. https://github.com/guipsamora/pandas_exercises/tree/master


## Bonus- some others

### Visualisations
- Mathplotlib, Seaborn
- Plotly, Bokeh, Dash - interactive visualisations
- geopandas, geoplot - geospatial visualiations
- usual diagrams; wordclouds


https://towardsdatascience.com/introduction-to-data-visualization-in-python-89a54c97fbed
https://www.datacamp.com/tutorial/wordcloud-python
https://towardsdatascience.com/visualizing-geospatial-data-in-python-e070374fe621

### Interesting ML/AI packages
- Scikit-Learn
- SciPy
- PyTorch
- Hugging Face
- TensorFlow
- Keras
- NLTK

### Markdown for notebooks

1. https://www.kaggle.com/code/cuecacuela/the-ultimate-markdown-cheat-sheet
1. https://www.ibm.com/docs/en/watson-studio-local/1.2.3?topic=notebooks-markdown-jupyter-cheatsheet


### Others
- type checking, e.g. mypy
- PIL


<div class="alert alert-block alert-info">
<b>A quick example of markdown fun</b> https://www.kaggle.com/code/cuecacuela/the-ultimate-markdown-cheat-sheet</div>

<div class="alert alert-block alert-success">
<b>Success:</b> This alert box indicates a successful or positive action.
</div>

<div class="alert alert-block alert-danger">
<b>Danger:</b> This alert box indicates a dangerous or potentially negative action.
</div>

In [74]:
import numpy as np
a = np.array([1,2,3])
print("Repeat {}".format(np.repeat(a,3)))
print("Tile {}".format(np.tile(a,3)))
np.concatenate([np.repeat(a,3),np.tile(a,3)])

Repeat [1 1 1 2 2 2 3 3 3]
Tile [1 2 3 1 2 3 1 2 3]


array([1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3])

In [75]:
# This will not work from work. You could download that tsv file locally though and just refer to its path 
import pandas as pd

chipo_df=pd.read_csv(r"https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv",sep="\t")
chipo_df

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98
...,...,...,...,...,...
4617,1833,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Sour ...",$11.75
4618,1833,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Sour Cream, Cheese...",$11.75
4619,1834,1,Chicken Salad Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Pinto...",$11.25
4620,1834,1,Chicken Salad Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Lettu...",$8.75
