<img src="img/ub-iup-logo_640.jpg" width=400 />



# IUP Python Tutorial


# What is and how can I work with python?

Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.

https://en.wikipedia.org/wiki/Python_(programming_language)

You can work using python:
 - Any IDE e.g. Pycharm, Visual studio Code
 - Jupyter notebook
 - From terminal using Ipython
 - Puting your code in a text file (usually with .py extension)

```bash
$ jupyter notebook
$ ipython
```

# Basic overview

In [99]:
# coment
# Variable definition
my_int = 2
my_string = 'hello world'
my_real = 2.0

print(my_int)

2


In [100]:
print(my_string)

hello world


In [101]:
# build in funcions
# e.g. operations

50 + 10

60

In [102]:
# and some Abstract data types like diccionaries:
my_catalog = {}
my_catalog['apples'] = 10
my_catalog['lemon'] = 10

In [103]:
# keys
print(my_catalog.keys())

dict_keys(['apples', 'lemon'])


In [104]:
# Values
print(my_catalog.values())

dict_values([10, 10])


In [105]:
print(my_catalog['apples'])

10


# Data types
Mutable or inmutable, full article (5 minutes reading)
https://medium.com/@meghamohan/mutable-and-immutable-side-of-python-c2145cf72747

All in python are objets: a mutable object can be changed after it is created, and an immutable object can’t.

<img src="img/1uFlTNY4W3czywyU18zxl8w.png" width=400 />



## Practical example

In [106]:
x = 10
y = x

In [107]:
# We are creating an object of type int. identifiers x and y points to the same object.
id(x)

9785152

In [108]:
id(y)

9785152

In [109]:
# if we do a simple operation.
x = x + 1

In [110]:
#Now
id(x) 

9785184

In [111]:
id(y)

9785152

In [112]:
id(10)

9785152

In [113]:
#The object in which x was tagged is changed. object 10 was never modified. Immutable objects doesn’t allow modification after creation
#In the case of mutable objects
m = list([1, 2, 3])
n = m

In [114]:
#We are creating an object of type list. identifiers m and m tagged to the same list object, which is a collection of 3 immutable int objects.
id(m)

139970940043456

In [115]:
id(n)

139970940043456

In [116]:
#Now poping an item from list object does change the object,
m.pop()

3

In [117]:
#object id will not be changed
id(m) 

139970940043456

In [118]:
id(n)

139970940043456

In [119]:
# User defined funcions

In [120]:
def my_operation(number):
    return number > 10

In [121]:
my_operation(1)

False

In [122]:
my_operation(100)

True

## How to use modules 

You can defined modules to agrupate functions and install and use modules.

In [123]:
import datetime

In [124]:
my_datetime = datetime.date(2020,11,10)

In [125]:
my_datetime.day

10

In [126]:
my_datetime.month

11

In [127]:
# geting package help
datetime.timedelta?

## Virtual environments

Helps to work in an isolate the packages dependencies.
- Pros: isolated environment
- Cons: dublicated the space for libraries

I'll create a virtual environment called `iup` (in bash terminal):


```bash
# Ubuntu virtual env intalation
sudo apt install virtualenv -y
...

# A new virtual enviroment called iup is created calling:
$ virtualenv iup
created virtual environment CPython3.8.5.final.0-64 in 486ms
...  
```


```bash
# activate virtual env
$ source iup/bin/activate
(iup) $

# deactivate virtual env
(iup) $ deactivate
$

```

## PIP
The Python Package Installer

https://github.com/pypa/pip

```bash
pip install jupyter
```

Now you will get accessible `jupyter` in `iup` virtual environment.
```bash
jupyter notebook
```

<img src="img/IPy_header.png" width=300 />

# Ipython


IPython provides a rich architecture for interactive computing

(iup) $

https://ipython.org/

```bash
ipython
```

<img src="img/1920px-Pandas_logo.svg.png" width=200 />

# Pandas

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language. 

https://pandas.pydata.org/

I'll follow this tutorial:
https://nbviewer.jupyter.org/github/justmarkham/pycon-2018-tutorial/blob/master/tutorial.ipynb

We need to import Pandas, and use an alias: `pd`

```python
import pandas as pd
pd.__version__
```

```python
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-53-e4680618b426> in <module>
----> 1 import pandas as pd
      2 pd.__version__
      3 

ModuleNotFoundError: No module named 'pandas'
```

Upd it is not installed, let's do it: 

```
(iup) $ pip install pandas
```
and matplotlib
```
(iup) $ pip install matplotlib
```

In [128]:
import pandas as pd
pd.__version__

'1.1.4'

In [129]:
import matplotlib.pyplot as plt

In [130]:
#only if you use jupyter notebook
%matplotlib inline 

In [131]:
# ri stands for San Diego
ri = pd.read_csv('ri_statewide_2020_04_01.csv', low_memory=False)

In [132]:
ri.head()  # head() is a method from ri object 

Unnamed: 0,raw_row_number,date,time,zone,subject_race,subject_sex,department_id,type,arrest_made,citation_issued,...,reason_for_stop,vehicle_make,vehicle_model,raw_BasisForStop,raw_OperatorRace,raw_OperatorSex,raw_ResultOfStop,raw_SearchResultOne,raw_SearchResultTwo,raw_SearchResultThree
0,1,2005-11-22,11:15:00,X3,white,male,200,vehicular,False,True,...,Speeding,,,SP,W,M,M,,,
1,2,2005-10-01,12:20:00,X3,white,male,200,vehicular,False,True,...,Speeding,,,SP,W,M,M,,,
2,3,2005-10-01,12:30:00,X3,white,female,200,vehicular,False,True,...,Speeding,,,SP,W,F,M,,,
3,4,2005-10-01,12:50:00,X3,white,male,200,vehicular,False,True,...,Speeding,,,SP,W,M,M,,,
4,5,2005-10-01,13:10:00,X3,white,female,200,vehicular,False,True,...,Speeding,,,SP,W,F,M,,,


In [133]:
ri.shape  #shape is a value of ri object

(509681, 31)

1. what do these numbers mean?
1. What does NaN mean?
1. Why might a value be missing?
1. Why mark it as NaN? 
1. Why not mark it as a 0 or an empty string or a string saying "Unknown"?


`NaN` = Missing value, data corruption, irrelevant data related with the other column, etc.


In [134]:
ri.dtypes

raw_row_number            int64
date                     object
time                     object
zone                     object
subject_race             object
subject_sex              object
department_id            object
type                     object
arrest_made              object
citation_issued          object
outcome                  object
contraband_found         object
contraband_drugs         object
contraband_weapons       object
contraband_alcohol       object
contraband_other         object
frisk_performed          object
search_conducted           bool
search_basis             object
reason_for_search        object
reason_for_stop          object
vehicle_make             object
vehicle_model            object
raw_BasisForStop         object
raw_OperatorRace         object
raw_OperatorSex          object
raw_ResultOfStop         object
raw_SearchResultOne      object
raw_SearchResultTwo      object
raw_SearchResultThree    object
dtype: object

In [135]:
# what are these counts? how does this work?
ri.isnull().sum()

raw_row_number                0
date                         10
time                         10
zone                         10
subject_race              29073
subject_sex               29097
department_id                10
type                          0
arrest_made               29073
citation_issued           29073
outcome                   35841
contraband_found         491919
contraband_drugs         493693
contraband_weapons       497886
contraband_alcohol       508464
contraband_other         491919
frisk_performed              10
search_conducted              0
search_basis             491919
reason_for_search        491919
reason_for_stop           29073
vehicle_make             191564
vehicle_model            279593
raw_BasisForStop          29073
raw_OperatorRace          29073
raw_OperatorSex           29073
raw_ResultOfStop          29073
raw_SearchResultOne      491919
raw_SearchResultTwo      508862
raw_SearchResultThree    509513
dtype: int64

`contraband_found`, `search_basis` and `reason_for_search`, all values missing (see ri.shape before)

## when someone is stopped for speeding, how often is it a man or woman?

Let's ask this to the dataframe... step by step

In [136]:
# Get a mask:
ri.reason_for_stop == 'Speeding'

0          True
1          True
2          True
3          True
4          True
          ...  
509676    False
509677     True
509678    False
509679     True
509680    False
Name: reason_for_stop, Length: 509681, dtype: bool

In [137]:
# apply over the data frame
ri[ri.reason_for_stop == 'Speeding']

Unnamed: 0,raw_row_number,date,time,zone,subject_race,subject_sex,department_id,type,arrest_made,citation_issued,...,reason_for_stop,vehicle_make,vehicle_model,raw_BasisForStop,raw_OperatorRace,raw_OperatorSex,raw_ResultOfStop,raw_SearchResultOne,raw_SearchResultTwo,raw_SearchResultThree
0,1,2005-11-22,11:15:00,X3,white,male,200,vehicular,False,True,...,Speeding,,,SP,W,M,M,,,
1,2,2005-10-01,12:20:00,X3,white,male,200,vehicular,False,True,...,Speeding,,,SP,W,M,M,,,
2,3,2005-10-01,12:30:00,X3,white,female,200,vehicular,False,True,...,Speeding,,,SP,W,F,M,,,
3,4,2005-10-01,12:50:00,X3,white,male,200,vehicular,False,True,...,Speeding,,,SP,W,M,M,,,
4,5,2005-10-01,13:10:00,X3,white,female,200,vehicular,False,True,...,Speeding,,,SP,W,F,M,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
509670,509671,2015-12-27,12:45:00,X4,white,male,500,vehicular,False,True,...,Speeding,HOND,ACCORD,SP,W,M,M,,,
509671,509672,2015-12-27,13:43:00,X4,white,male,500,vehicular,False,True,...,Speeding,ACUR,TSX,SP,W,M,M,,,
509672,509673,2015-12-28,02:29:00,K2,white,male,900,vehicular,False,True,...,Speeding,TOYT,COROLL,SP,W,M,M,,,
509677,509678,2015-12-20,11:17:00,K3,white,female,300,vehicular,False,True,...,Speeding,,,SP,W,F,M,,,


see the collumn `reason_for_stop`

In [138]:
# get only the subject_sex collumn
ri[ri.reason_for_stop == 'Speeding'].subject_sex

0           male
1           male
2         female
3           male
4         female
           ...  
509670      male
509671      male
509672      male
509677    female
509679    female
Name: subject_sex, Length: 268744, dtype: object

In [139]:
# count the values
ri[ri.reason_for_stop == 'Speeding'].subject_sex.value_counts()

male      182538
female     86198
Name: subject_sex, dtype: int64

In [140]:
# normalize it
ri[ri.reason_for_stop == 'Speeding'].subject_sex.value_counts(normalize=True)

male      0.679247
female    0.320753
Name: subject_sex, dtype: float64