# Week 1: Why python, pandas, numpy (and others) ?

**Python**
- popular, easy(ish) to learn though gradual learning curve
- widely used in Data Science, Machine Learning, etc
- popularity leading to lots of efficient packages 


**Numpy**
- package for scientific computing 
- wide variety of mathematical operations on arrays
- vectorisation - efficient calculations with arrays and matrices
- wide range of mathmatical functions
- basis for pandas, matplotlib, etc
- optimised and pre-compiled C code

**Pandas**
- Python library used for working with data sets
- widely used for data wrangling, used with many other data science modules in Python. And big data.

...
- Data cleansing
- Data fill
- Data normalization
- Merges and joins, grouping
- Data visualization
- Statistical analysis
- Data inspection
- Loading and saving data

...
- Easy handling of missing data (represented as NaN) in both floating point and non-floating point data
- Size mutability: columns can be inserted and deleted from DataFrames and higher-dimensional objects
- Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let series, DataFrame, etc. automatically align the data in computations
- Powerful, flexible group-by functionality to perform split-apply-combine operations on data sets for both aggregating and transforming data
- Making it easy to convert ragged, differently indexed data in other Python and Numpy data structures into DataFrame objects
- Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
- Intuitive merging and joining of data sets
- Flexible reshaping and pivoting of data sets
- Hierarchical labeling of axes (possible to have multiple labels per tick)
- Robust I/O tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving/loading data from the ultrafast HDF5 format
- Time series-specific functionality: date range generation and frequency conversion, moving window statistics, date shifting, and lagging


**Visualisations**
- Mathplotlib, Seaborn
- Plotly, Bokeh, Dash - interactive visualisations
- geopandas, geoplot - geospatial visualiations
- usual diagrams; wordclouds


https://towardsdatascience.com/introduction-to-data-visualization-in-python-89a54c97fbed
https://www.datacamp.com/tutorial/wordcloud-python
https://towardsdatascience.com/visualizing-geospatial-data-in-python-e070374fe621

**And so on**
- Scikit-Learn
- SciPy
- PyTorch
- Hugging Face
- TensorFlow
- Keras
- NLTK


## TODO
- diplay image with PIL; change its colours
- plot something simple, boxplot, violin plot, joint plot with regression
- static type checker

In [None]:
# vectorisation examples - timeit

In [25]:
import numpy as np

a = b = range(1,6000)



In [6]:
%%timeit

c = []
for i in range(len(a)):
    c.append(a[
        i]*b[i])
c

1.91 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [9]:
%%timeit

d = np.array(a)*np.array(b)
d

738 µs ± 15.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [14]:
%%timeit

np.array(a)*2

374 µs ± 9.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


Some nice timing comparisons
https://medium.com/pythoneers/vectorization-in-python-an-alternative-to-python-loops-2728d6d7cd3e

https://www.programiz.com/python-programming/numpy/vectorization#:~:text=Let%27s%20take%20a%20simple%20example,each%20element%20of%20the%20array.&text=Here%2C%20the%20number%2010%20adds,is%20possible%20because%20of%20vectorization.

# Exercises

### Numpy
https://www.machinelearningplus.com/python/101-numpy-exercises-python/

https://www.w3schools.com/python/numpy/exercise.asp

### Pandas
https://www.machinelearningplus.com/python/101-pandas-exercises-python/

https://www.w3schools.com/python/pandas/exercise.asp

https://github.com/guipsamora/pandas_exercises/tree/master


### Some of answers
https://www.kaggle.com/code/icarofreire/pandas-24-useful-exercises-with-solutions

In [19]:
import numpy as np
a = np.array([1,2,3])
print("Repeat {}".format(np.repeat(a,3)))
print("Tile {}".format(np.tile(a,3)))
np.concatenate([np.repeat(a,3),np.tile(a,3)])

Repeat [1 1 1 2 2 2 3 3 3]
Tile [1 2 3 1 2 3 1 2 3]


array([1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3])

# Tutorials and docs

Lots of resources depending on your learning style. Some examples

## package documentation
https://numpy.org/doc/stable/user/index.html

https://pandas.pydata.org/pandas-docs/stable/reference/index.html

# w3schools
https://www.w3schools.com/python/numpy/default.asp

https://www.w3schools.com/python/pandas/default.asp

## python
- https://docs.python.org/3/tutorial/index.html
- or pluralsight
- https://realpython.com/


## markdown
1. https://www.ibm.com/docs/en/watson-studio-local/1.2.3?topic=notebooks-markdown-jupyter-cheatsheet
1. <div class="alert alert-block alert-info">
<b>Just for fun</b> https://www.kaggle.com/code/cuecacuela/the-ultimate-markdown-cheat-sheet</div>

In [21]:
import pandas as pd

chipo_df=pd.read_csv(r"https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv",sep="\t")
chipo_df

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98
...,...,...,...,...,...
4617,1833,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Sour ...",$11.75
4618,1833,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Sour Cream, Cheese...",$11.75
4619,1834,1,Chicken Salad Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Pinto...",$11.25
4620,1834,1,Chicken Salad Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Lettu...",$8.75
