## Exercise 1

- This exercise is a part of the microsoft learn collection

In [1]:
import pandas as pd
import statsmodels.formula.api as smf

# lib written by Microsoft for easy plotting
import graphing

import plotly.express

In [2]:
# This line is requried to displaying plots created using plotly
import plotly.io as pio
pio.renderers.default = 'iframe'

## Boots that fit

In this scenario, you own a shop that sells harnesses for avalanche-rescue dogs, and you’ve recently expanded to also sell doggy boots. Customers all seem to pick the correct harness sizes, but are constantly ordering doggy boots that are the wrong size. You know most customers buy harnesses and boots in the same transaction, which gives you an idea: perhaps you could approximate which doggy boots are the correct size, depending on the harness chosen. Then, you could warn customers if the boots they have selected are likely to be the correct size before they make the purchase.

### creating a simple dataset (a dict for now) containing boot sizes and harness size in cm


In [2]:
data = {
    'boot_size' : [ 39, 38, 37, 39, 38, 35, 37, 36, 35, 40, 
                    40, 36, 38, 39, 42, 42, 36, 36, 35, 41, 
                    42, 38, 37, 35, 40, 36, 35, 39, 41, 37, 
                    35, 41, 39, 41, 42, 42, 36, 37, 37, 39,
                    42, 35, 36, 41, 41, 41, 39, 39, 35, 39
 ],
    'harness_size': [ 58, 58, 52, 58, 57, 52, 55, 53, 49, 54,
                59, 56, 53, 58, 57, 58, 56, 51, 50, 59,
                59, 59, 55, 50, 55, 52, 53, 54, 61, 56,
                55, 60, 57, 56, 61, 58, 53, 57, 57, 55,
                60, 51, 52, 56, 55, 57, 58, 57, 51, 59
                ]
}

In [3]:
dataset = pd.DataFrame(data)
dataset

Unnamed: 0,boot_size,harness_size
0,39,58
1,38,58
2,37,52
3,39,58
4,38,57
5,35,52
6,37,55
7,36,53
8,35,49
9,40,54


In [4]:

# declaring a formula (Saying that the boot_size is explained by harness_size)
formula = "boot_size ~ harness_size"

# creating the model
model = smf.ols(formula=formula, data=dataset)

- OLS models have two parameters (a slope and an offset), but these haven't been set in our model yet. We need to train (fit) our model to find these values so that the model can reliably estimate dogs' boot size based on their harness size.

In [5]:


trained_model = model.fit()

print('Model params:\n' + f'Line slope: {trained_model.params[1]}\n' +
     f'Line intercept: {trained_model.params[0]}')

Model params:
Line slope: 0.5859254167382713
Line intercept: 5.7191098126825874


In [8]:
graphing.scatter_2D(dataset, label_x="harness_size",
                             label_y="boot_size",
                             trendline=lambda x: trained_model.params[1] * x + trained_model.params[0])

Now we can use the model we have created to predict the boot size

In [10]:
harness_size = {'harness_size': [42.5]}
approx_boot_size = trained_model.predict(harness_size)

print(f'Estimated boot size is: {approx_boot_size[0]}')

Estimated boot size is: 30.620940024059117


# Performing same analysis on bigger data set

In [1]:
import pandas as pd
import graphing

#plotly rendering
import plotly.io as pio
pio.renderers.default = 'iframe'


In [3]:
dataset = pd.read_csv('doggy-boot-harness.csv')
dataset.head()

Unnamed: 0,boot_size,harness_size,sex,age_years
0,39,58,male,12.0
1,38,58,male,9.6
2,37,52,female,8.6
3,39,58,male,10.2
4,38,57,male,7.8


In [17]:
# no: of rows of data we have
print(f'No: of rows: {len(dataset)}')

# no: of male and female dogs
male_dogs = dataset[dataset.sex == 'male']
print(f'Male dogs: {male_dogs} \n No: of male dogs: {len(male_dogs)}\n')

female_dogs = dataset[dataset.sex == 'female']
print(f'Female dogs: {female_dogs} \n No: of female dogs: {len(female_dogs)}')

No: of rows: 50
Male dogs:     boot_size  harness_size   sex  age_years
0          39            58  male       12.0
1          38            58  male        9.6
3          39            58  male       10.2
4          38            57  male        7.8
8          35            49  male       13.2
10         40            59  male        3.6
11         36            56  male       10.0
13         39            58  male       12.4
14         42            57  male        7.4
15         42            58  male       12.0
16         36            56  male       11.4
19         41            59  male       11.4
21         38            59  male        9.6
23         35            50  male        6.4
28         41            61  male       10.8
29         37            56  male        6.2
31         41            60  male        4.2
33         41            56  male        5.2
34         42            61  male        5.8
35         42            58  male        6.6
37         37            57 

In [19]:
# Make a copy of the dataset that only contains dogs with 
# a boot size below size 40

smaller_paws = dataset[dataset.boot_size < 40].copy()
print(f'We have {len(smaller_paws)} no: of dogs having paw size less than 40')

We have 34 no: of dogs having paw size less than 40


### Plotting

In [21]:
plotly.express.scatter(data_smaller_paws, x="harness_size", y="boot_size")