<a href="https://colab.research.google.com/github/mrdandelion6/python-data-analysis-suds/blob/main/Python_and_Data_Analytics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PYTHON AND DATA ANALYTICS



###Working with modules

Python comes with several modules built in. To use one, we import it. Normally, import statements go at the very top of the file -- we're only putting them in the middle here for teaching purposes

In [None]:
import math


When using code in a module, we first reference the module, follwed by a period, then the value or function being used.



In [None]:
math.pi

3.141592653589793

Not referencing the module produces an error.



In [None]:
pi

NameError: name 'pi' is not defined

If we only want a part of a module, we can import parts with a from...import statement. This is useful for large modules like datetime, where we may only want to use a few features.

In [None]:
from datetime import date


### **NUMPY**
We can import numpy like any other module. For convenience, numpy is typically loaded as np -- an alias that makes referencing it easier.


In [None]:
import numpy as np

### `numpy` arrays
The main object in `numpy` is the `ndarray`, also referred to as the `array`. Dimensions in an array are called `axes`.

We can create an array by calling `np.array()` and passing in data as a single value, like a list. Below is a matrix. The first axis has a length of two, and the second axis has a length of 3.

In [None]:
a = np.array([[1,4, 3],
              [3, 4,1],
              [3, 5,1]])

a = np.array([[52,6],
              [9==0, 6==6]])


a.shape

(2, 2)

An `array` has an `ndim` attribute indicating the number of its axes, a `size `indicating the number of values it has, and a `shape` indicating its size in each dimension. It also has a `dtype` describing what data type all of the elements in the array are.



In [None]:
# number of dimensions
print(a.ndim)

# notice that the shape is rows x columns
print(a.shape)

# notice that the size is rows * columns
print(a.size)

# int32 is a numpy-provided dtype
print(a.dtype)

2
(2, 2)
4
int64


We can create arrays with placeholder content in several ways. This is useful when we know how many elements will be in an array, but not their values, as numpy arrays have fixed size




In [None]:
# create an 2x3x2 array of zeros. notice the double parentheses
a=np.zeros((2, 3,6))
a

array([[[0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0.]],

       [[0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0.]]])

We can also create arrays by specifying a range of values through `arange()` or generating random ones through functions like `random.randint()` and `random.random()`.

In [None]:
# create a 1D array from 1 til 10 in steps of 2
np.arange(1, 20, 2)


array([ 1,  3,  5,  7,  9, 11, 13, 15, 17, 19])

In [None]:
# create a 1D array from 0 to 1 in steps of 0.1
np.arange(0, 1, 0.1)

array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])

In [None]:
# create a 3x4 array of random integers between 1 and 10
np.random.randint(1, 10, (4))

# how can we make it so that it chooses it randomly but every time its the same set of random values?

array([6, 8, 8, 1])

In [None]:
# seed is used to create a reproducible random example
np.random.seed(11)
# create a 3x4 array of random integers between 1 and 10
np.random.randint(1, 10, (4))

array([1, 2, 8, 2])

We can repeat values and arrays to create bigger ones with repeat() and tile().



In [None]:
# create a 1D array through repetition
np.repeat(2.5, 8)

array([2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5])

In [None]:
# create a 2D array through repetition
onedim_arr = np.array([1, 2, 3, 4, 5])
multidim_arr = np.tile(onedim_arr, (2,3))
print(multidim_arr)

# NOTE: try help(np.tile)

[[1 2 3 4 5 1 2 3 4 5 1 2 3 4 5]
 [1 2 3 4 5 1 2 3 4 5 1 2 3 4 5]]


 The comand `np.newaxis` is used to increase the dimension of the existing array by one more dimension, when used once. Thus,

* 1D array will become 2D array

* 2D array will become 3D array

* 3D array will become 4D array

* 4D array will become 5D array

and so on

In [None]:
# 1D array
arr = np.arange(4)
arr.shape


(4,)

In [None]:
# make it as row vector by inserting an axis along first dimension
row_vec = arr[np.newaxis, :]
row_vec.shape


(1, 4)

## Basic opertations

`numpy arrays` allow us to perform vector operations, manipulating all the elements in an axis without writing loops. For example, we can calculte a function to all of the elements in an array just with one simple command

In [None]:
degrees = np.arange(0, 360, 45)* np.pi/180
degrees

np.sin(degrees)

# print(dir(np)) # NOTE: what else can numpy do?

array([ 0.00000000e+00,  7.07106781e-01,  1.00000000e+00,  7.07106781e-01,
        1.22464680e-16, -7.07106781e-01, -1.00000000e+00, -7.07106781e-01])

We can perform operations when arrays are the same length along the axis in use, or when values can be broadcast, or repeated, along an axis.



In [None]:
arr1 = np.array([5, 10, 15, 20])
arr2 = np.arange(5, 9)
print(arr1)
print(arr2)
arr3 = np.array([1,2])
print(arr3)

[ 5 10 15 20]
[5 6 7 8]
[1 2]


In [None]:
# incompatible shapes!
arr2+arr3

ValueError: operands could not be broadcast together with shapes (4,) (2,) 

We can also summarize the values in an array.



In [None]:
print(arr2)
print(np.median(arr2))
print(np.mean(arr2))
print(np.sum(arr2))
print(np.max(arr2))


[5 6 7 8]
6.5
6.5
26
8


## Indexing, slicing, and iterating
We can index and slice arrays like we would a list.



In [None]:
#arr1[0]
print(arr1)
print(arr1[1:3])


[ 5 10 15 20]
[10 15]


Multidimensional arrays like matrices have one index per axis. We can pass in more than one index within the square brackets.



In [None]:
tens = np.arange(0, 120, 10).reshape(3, 4)
print(tens)
tens[0:1,1:3]

[[  0  10  20  30]
 [ 40  50  60  70]
 [ 80  90 100 110]]


array([[10, 20]])

## **PANDAS**

In [None]:
import pandas as pd

### Data frames

We can create a DataFrame manually with `DataFrame()` constructor. If a dictionary is passed to `DataFrame()`, the keys become column names, and the values become the rows. Calling just `DataFrame()` creates an empty DataFrame to which data can be added later.



In [None]:
trees = pd.DataFrame({
    'name': ['sugar maple', 'black oak', 'white ash', 'douglas fir'],
    'avg_lifespan': [300, 100, 260, 450],
    'quantity': [53, 207, 178, 93]
})
trees

Unnamed: 0,name,avg_lifespan,quantity
0,sugar maple,300,53
1,black oak,100,207
2,white ash,260,178
3,douglas fir,450,93


## Load data from csv

Of course, we're more likely to load data into a DataFrame than to create DataFrames manually. `pandas` has read functions for different file formats. To read data from a csv or other delimited file, we use `pd.read_csv()`, then pass in the local file path or the URL of the csv to read. pandas will infer the data type of each column based on the values in the first chunk of the file loaded.

###**Bicycle Thefts data set**
This dataset contains Bicycle Thefts occurrences by reported date and details regarding the stolen item where available. This data includes all bicycle theft occurrences reported to the Toronto Police Service, including those where the location has not been able to be verified. As a result, coordinate fields may appear blank. Likewise, this includes occurrences where the coordinate location is outside the City of Toronto.




In [None]:
# Data: https://open.toronto.ca/dataset/bicycle-thefts/

# from google.colab import drive

# drive.mount('/content/drive')

thefts = pd.read_csv('/content/drive/MyDrive/SUDS Notebooks/content/bicycle-thefts - 4326.csv')


In [None]:
thefts

Unnamed: 0,_id,EVENT_UNIQUE_ID,PRIMARY_OFFENCE,OCC_DATE,OCC_YEAR,OCC_MONTH,OCC_DOW,OCC_DAY,OCC_DOY,OCC_HOUR,...,LOCATION_TYPE,PREMISES_TYPE,BIKE_MAKE,BIKE_MODEL,BIKE_TYPE,BIKE_SPEED,BIKE_COLOUR,BIKE_COST,STATUS,geometry
0,1,GO-20141263784,PROPERTY - FOUND,,2014,January,Wednesday,1,1,18,...,"Single Home, House (Attach Garage, Cottage, Mo...",House,TREK,SOHO S,RG,1.0,BLK,,RECOVERED,"{'type': 'MultiPoint', 'coordinates': [[-79.41..."
1,2,GO-20141261431,THEFT UNDER,,2014,January,Wednesday,1,1,7,...,"Apartment (Rooming House, Condo)",Apartment,SUPERCYCLE,,MT,10.0,,,STOLEN,"{'type': 'MultiPoint', 'coordinates': [[-79.44..."
2,3,GO-20141263544,B&E,1388-03-04,2013,December,Thursday,26,360,19,...,Other Commercial / Corporate Places (For Profi...,Commercial,FELT,F59,RC,21.0,SILRED,1300.0,STOLEN,"{'type': 'MultiPoint', 'coordinates': [[-79.39..."
3,4,GO-20141266048,THEFT UNDER,,2013,December,Monday,30,364,17,...,"Streets, Roads, Highways (Bicycle Path, Privat...",Outside,KHS,VITAMIN A,OT,24.0,WHI,500.0,STOLEN,"{'type': 'MultiPoint', 'coordinates': [[-79.42..."
4,5,GO-20149000090,THEFT UNDER,,2014,January,Wednesday,1,1,12,...,"Apartment (Rooming House, Condo)",Apartment,GI,TCX2 (2010),OT,9.0,BLU,1019.0,STOLEN,"{'type': 'MultiPoint', 'coordinates': [[-79.39..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35320,35321,GO-2024676645,PROPERTY - FOUND,,2024,March,Friday,29,89,16,...,"Streets, Roads, Highways (Bicycle Path, Privat...",Outside,CCM,,RG,0.0,BLU,50.0,UNKNOWN,"{'type': 'MultiPoint', 'coordinates': [[-79.41..."
35321,35322,GO-20249016258,THEFT UNDER - BICYCLE,,2024,March,Friday,29,89,15,...,"Apartment (Rooming House, Condo)",Apartment,GI,GIANT ESCAPE 2,RG,,GRY,900.0,STOLEN,"{'type': 'MultiPoint', 'coordinates': [[-79.40..."
35322,35323,GO-20249016319,THEFT UNDER - BICYCLE,,2024,March,Saturday,30,90,1,...,"Apartment (Rooming House, Condo)",Apartment,UK,SPECIALIZED ALL,RC,16.0,BLK,1199.0,STOLEN,"{'type': 'MultiPoint', 'coordinates': [[-79.44..."
35323,35324,GO-2024681500,THEFT OVER - BICYCLE,1710-09-07,2024,March,Wednesday,20,80,10,...,"Streets, Roads, Highways (Bicycle Path, Privat...",Outside,SANTA CRUZ,MEGA TOWER,MT,1.0,GRY,8000.0,STOLEN,"{'type': 'MultiPoint', 'coordinates': [[-79.38..."


### Profiling and initial data cleaning
We got our data, but now we need to understand what's in it. We can start to understand the DataFrame by checking out its `dtypes` and `shape` attributes, which give column data types and row by column dimensions, respectively. Note that `object` is `pandas`' way of saying values are represented as string data.



In [None]:
thefts.shape

(35325, 28)

In [None]:
thefts.dtypes

_id                  int64
EVENT_UNIQUE_ID     object
PRIMARY_OFFENCE     object
OCC_DATE            object
OCC_YEAR             int64
OCC_MONTH           object
OCC_DOW             object
OCC_DAY              int64
OCC_DOY              int64
OCC_HOUR             int64
REPORT_DATE         object
REPORT_YEAR          int64
REPORT_MONTH        object
REPORT_DOW          object
REPORT_DAY           int64
REPORT_DOY           int64
REPORT_HOUR          int64
DIVISION            object
LOCATION_TYPE       object
PREMISES_TYPE       object
BIKE_MAKE           object
BIKE_MODEL          object
BIKE_TYPE           object
BIKE_SPEED         float64
BIKE_COLOUR         object
BIKE_COST          float64
STATUS              object
geometry            object
dtype: object

In [None]:
thefts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35325 entries, 0 to 35324
Data columns (total 28 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   _id              35325 non-null  int64  
 1   EVENT_UNIQUE_ID  35325 non-null  object 
 2   PRIMARY_OFFENCE  35325 non-null  object 
 3   OCC_DATE         5888 non-null   object 
 4   OCC_YEAR         35325 non-null  int64  
 5   OCC_MONTH        35325 non-null  object 
 6   OCC_DOW          35325 non-null  object 
 7   OCC_DAY          35325 non-null  int64  
 8   OCC_DOY          35325 non-null  int64  
 9   OCC_HOUR         35325 non-null  int64  
 10  REPORT_DATE      5923 non-null   object 
 11  REPORT_YEAR      35325 non-null  int64  
 12  REPORT_MONTH     35325 non-null  object 
 13  REPORT_DOW       35325 non-null  object 
 14  REPORT_DAY       35325 non-null  int64  
 15  REPORT_DOY       35325 non-null  int64  
 16  REPORT_HOUR      35325 non-null  int64  
 17  DIVISION    

## Profiling columns
It can be useful to focus on a subset of columns, particularly to understand value sets.


In [None]:
thefts['STATUS']

0        RECOVERED
1           STOLEN
2           STOLEN
3           STOLEN
4           STOLEN
           ...    
35320      UNKNOWN
35321       STOLEN
35322       STOLEN
35323       STOLEN
35324       STOLEN
Name: STATUS, Length: 35325, dtype: object


To get unique values, we can use the `unique()` Series method. If we want to count how many times each value appears, we can use the `value_counts()` method.

In [None]:
thefts['STATUS'].unique()

array(['RECOVERED', 'STOLEN', 'UNKNOWN'], dtype=object)

In [None]:
thefts['STATUS'].value_counts()

STATUS
STOLEN       34359
UNKNOWN        593
RECOVERED      373
Name: count, dtype: int64

We can summarize numeric Series much like we did with `numpy` functions.



In [None]:
thefts['BIKE_COST'].mean()

999.4065437648059

## Changing data types:
Sometimes we would like to change the type of one or more variables.
For example to convert a column to datetime, we use the `pd.to_datetime()` function, passing in the column to convert, and reassign the output back to the column we're converting.

`pandas` knows how to convert the dates in the bike thefts data, but for less common formats, it is necessary to use the format keyword argument to specify how dates should be parsed. format strings use strftime codes. See: https://strftime.org/ for a cheat sheet.

In [None]:
thefts['OCC_DATE'] = pd.to_datetime(thefts['OCC_DATE'], format='%Y-%m-%d', errors='coerce')
thefts['OCC_DATE']

2              NaT
3              NaT
4              NaT
5              NaT
6              NaT
           ...    
35320          NaT
35321          NaT
35322          NaT
35323   1710-09-07
35324   1710-09-07
Name: OCC_DATE, Length: 32926, dtype: datetime64[ns]

All other data type conversions can be done with the `astype()` method. If we were converting to a number,` pd.to_numeric()` provides an easy way to convert without having to pick a specific numeric data type. We can also covert variables to category data type

In [None]:
thefts['STATUS'] = thefts['STATUS'].astype('category')
thefts['STATUS']

0        RECOVERED
1           STOLEN
2           STOLEN
3           STOLEN
4           STOLEN
           ...    
35320      UNKNOWN
35321       STOLEN
35322       STOLEN
35323       STOLEN
35324       STOLEN
Name: STATUS, Length: 35325, dtype: category
Categories (3, object): ['RECOVERED', 'STOLEN', 'UNKNOWN']

## Filtering and selecting data

To find the records with no line, we can use `.loc[]`, which lets us access rows and columns with either a boolean array or row/column labels.

In this case, the boolean array is the product of the `isna()` Series method.

In [None]:
thefts.loc[thefts['BIKE_COST'].isna()]

Unnamed: 0,_id,EVENT_UNIQUE_ID,PRIMARY_OFFENCE,OCC_DATE,OCC_YEAR,OCC_MONTH,OCC_DOW,OCC_DAY,OCC_DOY,OCC_HOUR,...,LOCATION_TYPE,PREMISES_TYPE,BIKE_MAKE,BIKE_MODEL,BIKE_TYPE,BIKE_SPEED,BIKE_COLOUR,BIKE_COST,STATUS,geometry
0,1,GO-20141263784,PROPERTY - FOUND,,2014,January,Wednesday,1,1,18,...,"Single Home, House (Attach Garage, Cottage, Mo...",House,TREK,SOHO S,RG,1.0,BLK,,RECOVERED,"{'type': 'MultiPoint', 'coordinates': [[-79.41..."
1,2,GO-20141261431,THEFT UNDER,,2014,January,Wednesday,1,1,7,...,"Apartment (Rooming House, Condo)",Apartment,SUPERCYCLE,,MT,10.0,,,STOLEN,"{'type': 'MultiPoint', 'coordinates': [[-79.44..."
45,46,GO-20141427259,B&E,,2014,January,Tuesday,28,28,19,...,"Apartment (Rooming House, Condo)",Apartment,OTHER,,MT,10.0,BLK,,STOLEN,"{'type': 'MultiPoint', 'coordinates': [[-79.49..."
48,49,GO-20141432628,THEFT UNDER,,2014,January,Tuesday,28,28,15,...,Other Commercial / Corporate Places (For Profi...,Commercial,CUSTOM,,OT,0.0,BLU,,STOLEN,"{'type': 'MultiPoint', 'coordinates': [[-79.36..."
50,51,GO-20149000905,THEFT UNDER,,2014,January,Friday,31,31,11,...,Ttc Subway Station,Transit,OT,MONT STE-ANNE,MT,18.0,,,STOLEN,"{'type': 'MultiPoint', 'coordinates': [[-79.43..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35171,35172,GO-2024409976,THEFT UNDER,,2024,February,Friday,23,54,14,...,"Streets, Roads, Highways (Bicycle Path, Privat...",Outside,OTHER,,RG,1.0,BLKONG,,STOLEN,"{'type': 'MultiPoint', 'coordinates': [[-79.39..."
35182,35183,GO-2024431477,THEFT UNDER - BICYCLE,,2024,February,Saturday,24,55,20,...,"Apartment (Rooming House, Condo)",Apartment,OTHER,,OT,12.0,ONG,,STOLEN,"{'type': 'MultiPoint', 'coordinates': [[-79.38..."
35210,35211,GO-2024491772,THEFT UNDER - BICYCLE,,2024,January,Wednesday,31,31,18,...,"Apartment (Rooming House, Condo)",Apartment,VEVOR,,TR,1.0,YEL,,STOLEN,"{'type': 'MultiPoint', 'coordinates': [[-79.45..."
35218,35219,GO-2024512227,B&E,,2024,March,Monday,4,64,14,...,"Private Property Structure (Pool, Shed, Detach...",Other,TREK,FX1,RG,1.0,BLK,,STOLEN,"{'type': 'MultiPoint', 'coordinates': [[-79.47..."


Suppose that for our analysis, we may drop all the samples without BIKE_COST informations. We can do that using `dropna()` DataFrame method. We can drop rows missing lines by passing a subset.

In [None]:
thefts=thefts.dropna(subset=['BIKE_COST'])

We can use `.loc[]` to create a delays DataFrame without the invalid lines. To do this, we first create a list of values to exclude, then pass the list to the Series` isin()` method. Finally, we negate the expression, and assign the output back to a new DataFrame

In [None]:
# set up filter list
filter_years = range(2009,2013)
thefts_filter = thefts.loc[thefts['OCC_YEAR'].isin(filter_years)]
thefts_filter['OCC_YEAR'].unique()


array([2012, 2011, 2010, 2009])

## Grouping
A core workflow in `pandas` is *split-apply-combine*:

* **splitting** data into groups
* **applying** a function to each group, such as calculating group sums, standardizing data, or filtering out some groups
* **combining** the results into a data structure

This workflow starts by grouping data by calling the `groupby()` method. We'll pass in a column name or list of names to group by.

In [None]:
thefts_group=thefts.groupby('OCC_DAY')

`groupby()` returns a grouped DataFrame that we can use to calculate groupwise statistics. The grouping column values become indexes, or row labels. **Note that this grouped DataFrame still references the original, so mutating one affects the other**

In [None]:
thefts_group['OCC_HOUR'].mean()

OCC_DAY
1     11.815422
2     13.588174
3     13.433759
4     13.472081
5     13.791460
6     13.733728
7     13.118095
8     13.663971
9     13.923664
10    13.774476
11    13.152995
12    13.212204
13    13.509284
14    13.750220
15    12.516156
16    13.293345
17    12.982222
18    13.327273
19    13.140320
20    13.423700
21    13.716330
22    13.514693
23    13.580258
24    13.152557
25    13.228950
26    13.197813
27    12.923440
28    13.493097
29    13.255446
30    13.355023
31    13.768916
Name: OCC_HOUR, dtype: float64

### `agg()` command

So far, we have applied one function at a time. The `agg()` DataFrame method lets us apply multiple functions on different columns at once.

agg()'s argument syntax follows this pattern:
```
DataFrame.agg(agg_colname=('column_to_aggregate', 'aggregation_function_name'),
             agg_colname2=('col_to_agg2', 'agg_func_name'))
```

Lets see an example








In [None]:
thefts_summary=(thefts.groupby('OCC_YEAR')
                      .agg(mean_time=('REPORT_HOUR', 'mean'),counts=('REPORT_HOUR', 'count')))



We want `OCC_YEAR` to be a column

In [None]:
thefts_summary=(thefts.groupby('OCC_YEAR')
                      .agg(mean_time=('REPORT_HOUR', 'mean'),counts=('REPORT_HOUR', 'count'))
                      .reset_index()) # make reported hour a regular col


thefts_summary
# thefts_summary['OCC_YEAR']

Unnamed: 0,OCC_YEAR,mean_time,counts
0,1975,15.0,1
1,1983,19.0,1
2,2009,17.0,1
3,2010,22.0,2
4,2011,14.666667,3
5,2012,15.5,2
6,2013,14.0,42
7,2014,14.423105,2744
8,2015,14.290333,3000
9,2016,14.159933,3564


In [None]:
thefts_summary2=(thefts.groupby(['OCC_YEAR', 'STATUS'])
                  .agg(thefts=('_id', 'count'))
                  .reset_index())
thefts_summary2.head()

Unnamed: 0,OCC_YEAR,STATUS,thefts
0,1975,RECOVERED,0
1,1975,STOLEN,1
2,1975,UNKNOWN,0
3,1983,RECOVERED,0
4,1983,STOLEN,1


## Reshaping data with pivot()


The next step is to make the Topic/Characteristic the column header, `pivot()`ing the values. To do this, we specify the column(s) to use as the index, or row labels; the column(s) whose values we should use as column names, and which column our values come from.

Pivoting on two columns creates a multi-level column header, so we then drop the top Topic level with `droplevel()`. Finally, we` reset_index()`


For example we could try to get treee new variables: number of recovered, stolen and unknown bikes by year

In [None]:
thefts_grouped = (thefts_summary2
                  .pivot(index='OCC_YEAR', columns='STATUS', values='thefts')
                  .reset_index()  # ...and again
                  .fillna(0)  # make occurrence year a regular col
                  )

thefts_grouped

## MATPLOTLIB

For historical reasons, when we import matplotlib, we really import `matplotlib`.pyplot. The conventional alias is` plt.`



In [None]:
# jupyter-specific "magic" command to render plots in-line
%matplotlib inline

import matplotlib as mpl
import matplotlib.pyplot as plt


### `pyplot`-style plotting
pyplot-style plotting is convenient for quick, exploratory plots, where we don't plan on doing a lot of customization. Lets try to make some plots using the thefts data set!


In [None]:
plt.plot(thefts['OCC_YEAR'],
            thefts['BIKE_COST'])

Let's better use a scatterplot instead with the `scatter()` function. We can use keyword arguments like `facecolor` and `edgecolor` to change the styling.

In [None]:
plt.scatter(thefts['OCC_YEAR'],
            thefts['BIKE_COST'],
            marker='s',  # square marker
            facecolor='#fb1',
            edgecolor='k') # black


Using the pyplot approach, the outputs of successive function calls in the same cell context are layered on.

In [None]:
plt.scatter(thefts_filter['OCC_YEAR'],
            thefts_filter['REPORT_HOUR'],
            edgecolor='k',
            label='reported time')

plt.scatter(thefts_filter['OCC_YEAR'],
            thefts_filter['OCC_HOUR'],
            edgecolor='w',
            label='occurence time') # black

plt.legend()

## Object-oriented approach to plotting
The object-oriented approach is the preferred method of plotting with matplotlib. In this approach, we use the `subplots()` function to create plot objects, then call methods to modify them.

By default, `subplots()` returns one Figure and one Axes. We can use Python's unpacking syntax to assign the Figure and Axes to their own variables in one line.

In [None]:
fig, ax = plt.subplots()

In [None]:
reported_hour = ax.scatter(thefts_filter['OCC_YEAR'],
           thefts_filter['REPORT_HOUR'])
Occurece_hour = ax.scatter(thefts_filter['OCC_YEAR'],
           thefts_filter['OCC_HOUR'])
ax.xaxis.set_major_locator(mpl.ticker.MaxNLocator(integer=True)) # to plot years as int values

fig

There are ways of adding labels, title, grids, legends, modifying axis and changing styles. We do not have time to cover all this today, but I encourage you to google it and play with your data


### Other plot types
Of course,` matplotlib` offers more than just line plots and scatterplots. Among the many kinds of plots we can make are bar plots, histograms, and boxplots. To create each the object-oriented way, we call the appropriate Axes method, like `Axes.boxplot()` or` Axes.barh()`, for a horizontal bar plot.

In [None]:
# create a histogram
hist_fig, hist_ax = plt.subplots()
hist_ax.hist(thefts['BIKE_COST'],
             bins=range(0, 8000, 1200))
hist_ax.set_title('Bike cost')


In [None]:
# create a box plot
box_fig, box_ax = plt.subplots()
box_ax.boxplot(thefts['BIKE_COST'],
                # add labels so we know which box is which var
              labels=['COST'])
box_ax.set_title('Bike Costs')

In [None]:
# create a barplot

bar_fig, bar_ax = plt.subplots()
bar_ax.barh(thefts_filter['LOCATION_TYPE'],thefts_filter['BIKE_COST'])
bar_ax.set_axisbelow(True)
bar_ax.grid(alpha=0.3)
bar_ax.set_title('Bike price by location type')
bar_ax.set_xlabel('Bike price (2009-2012)')




### More complex plots
Let's try plotting the number of reported bike thefts each year by whether the bike was recovered or not. We'll need to wrangle the theft data a bit to get counts by year and status. Then, we'll use the data to make a` stackplot()`.



In [None]:
# review the available columns
thefts.columns


In [None]:
thefts_grouped = (thefts
                  .groupby(['OCC_YEAR', 'STATUS'])
                  .agg(thefts=('_id', 'count'))
                  .reset_index()
                  .pivot(index='OCC_YEAR', columns='STATUS', values='thefts')
                  .reset_index()  # ...and again
                  .fillna(0)  # make occurrence year a regular col
                  )

thefts_grouped

In [None]:
stfig, stax = plt.subplots()

stax.stackplot(thefts_grouped['OCC_YEAR'], thefts_grouped['STOLEN'],
        thefts_grouped['RECOVERED'], thefts_grouped['UNKNOWN'],
       labels=['Stolen', 'Recovered', 'Unknown'])
stax.set_axisbelow(True)
stax.grid(alpha=0.3)
stax.legend(loc='upper left')
stax.set_title('Reported Bike Thefts by Recovery Status')
stax.set_ylabel('Reported Thefts')
stax.set_xlabel('Year')

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

years = range(2000, 2012)
apples = [0.895, 0.91, 0.919, 0.926, 0.929, 0.931, 0.934, 0.936, 0.937, 0.9375, 0.9372, 0.939]
oranges = [0.962, 0.941, 0.930, 0.923, 0.918, 0.908, 0.907, 0.904, 0.901, 0.898, 0.9, 0.896, ]

sns.set_style("whitegrid")
plt.plot(years, apples, 's-b')
plt.plot(years, oranges, 'o--r')

plt.xlabel('Year')
plt.ylabel('Yield (tons per hectare)')

plt.title("Crop Yields in Kanto")
plt.legend(['Apples', 'Oranges']);
