## Introduction to Data Science

#### University of Redlands - DATA 101
#### Prof: Joanna Bieri [joanna_bieri@redlands.edu](mailto:joanna_bieri@redlands.edu)
#### [Class Website: data101.joannabieri.com](https://joannabieri.com/data101.html)

---------------------------------------
# Homework Day 15
---------------------------------------

GOALS:

1. Practice making plots
2. Add trendlines to plots
3. Explore Linear Regression

----------------------------------------------------------

This homework has **4 questions** and **3 exercises**.


## Help with Algorithms!

Implementing algorithms can be very difficult. I would highly suggest that you **start by recreating the code that you see in the lecture**... copy and paste it and make sure it runs. THEN try to alter that code to do the exercises.

In [18]:
import numpy as np
!conda install pandas -y
import pandas as pd

!conda install matplotlib -y
import matplotlib.pyplot as plt
!conda install plotly -y
!conda install statsmodels -y
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.defaule = 'colab'

!conda install -c conda-forge itables -y
from itables import show

# This stops a few warning messages from showing
pd.options.mode.chained_assignment = None 
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

Channels:
 - conda-forge
 - defaults
 - anaconda
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

# All requested packages already installed.

Channels:
 - conda-forge
 - defaults
 - anaconda
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

# All requested packages already installed.

Channels:
 - conda-forge
 - defaults
 - anaconda
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

# All requested packages already installed.

Channels:
 - conda-forge
 - defaults
 - anaconda
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

# All requested packages already installed.

Channels:
 - conda-forge
 - defaults
 - anaconda
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

# All requested packages already installed.



In [43]:
# Load the Data
file_location = 'https://joannabieri.com/introdatascience/data/paris-paintings.csv'
DF_raw_paintings = pd.read_csv(file_location,na_filter=False)

In [44]:
show(DF_raw_paintings)

name,sale,lot,position,dealer,year,origin_author,origin_cat,school_pntg,diff_origin,logprice,price,count,subject,authorstandard,artistliving,authorstyle,author,winningbidder,winningbiddertype,endbuyer,Interm,type_intermed,Height_in,Width_in,Surface_Rect,Diam_in,Surface_Rnd,Shape,Surface,material,mat,materialCat,quantity,nfigures,engraved,original,prevcoll,othartist,paired,figures,finished,lrgfont,relig,landsALL,lands_sc,lands_elem,lands_figs,lands_ment,arch,mytho,peasant,othgenre,singlefig,portrait,still_life,discauth,history,allegory,pastorale,other
Loading ITables v2.2.3 from the internet... (need help?),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [45]:
# Make a copy of the data that we can start working on
DF = DF_raw_paintings.copy()

# Do something about all those different NaNs
DF.replace('',np.nan,inplace=True)
DF.replace('n/a',np.nan,inplace=True)
DF.replace('NaN',np.nan,inplace=True)

**Q1** Make historgrams of the height and width of all the paintings in the data set. You should be able to recreate the plots from the lecture without looking at the code.

Don't forget to change the values into floats liek we did in the lecture!

random line

In [46]:
columns=['Height_in','Width_in']
DF[columns].dtypes
#We have objects, we need floats.
DF['Height_in']=DF['Height_in'].apply(lambda x: float(x))
DF['Width_in']=DF['Width_in'].apply(lambda x: float(x))
#Now lets make histograms
for col in columns:
    fig = px.histogram(DF,x=col)

    fig.update_layout(template="ggplot2",
                  xaxis_title=f"{col} inches",
                  yaxis_title="")

    fig.show()

**Q2** Explain in words what these plots tell you about the data.

We can see that the majority of paintings in Paris have the majority of their height and width towards the lower end of the spectrum. That is, most occur before 50 inches for both variables - with the most frequent number around 13 inches with a frequency of around 300 -. We can see that the width has a wider range than height - largely due to the outlier at 200 inches.


**Q3** Make a scatter plot of the width vs the height like the one in the lecture. You should be able to recreate the plots here without looking at the code.

In [47]:
fig = px.scatter(DF,x='Width_in', y='Height_in',color_discrete_sequence=['black'])

fig.update_layout(template="ggplot2",
                  title='Height vs Width of Paris Paintings <br><sup> Paris auctions, 1764-1780</sup>',
                  xaxis_title= "Width in inches",
                  yaxis_title="Height in inches")

fig.show()

The plot below uses Ordinary Least Squares fitting to find a reasonable line.

In [48]:
# Example Code Trendline

DF['Height_in'] = DF['Height_in'].apply(lambda x: float(x))
DF['Width_in'] = DF['Width_in'].apply(lambda x: float(x))

fig = px.scatter(DF,
                 x='Width_in',
                 y="Height_in",
                 color_discrete_sequence=['black'],
                trendline='ols',
                trendline_scope='overall',
                trendline_color_override='blue')


fig.update_layout(template="ggplot2",
                  title='Height vs Width of Paris Paintings <br><sup> Paris auctions, 1764-1780</sup>',
                  title_x=0.5,
                  xaxis_title="Width (inches)",
                  yaxis_title="Height (inches)")

fig.show()

So the line that "fits" this data based on the code we ran is

$$ H = 0.7808 W + 3.6214 $$

**Q4** Where do we think this prediction is most accurate? Where is there the most error? Explain why you think this?

The prediction is most accurate where there is the most amount of data points, that is near the origin as the majority of the data points appear here. Further, the data points here are very clustered but as height and width increases the data points become less clustered indicating greater uncertainity. When there is less data points and they are not as clustered non-linear trends may not be captured and greater deviations of data points from the trendline occur.

In [49]:
# Example Code Trendline with Categories

DF['landsALL'] = DF['landsALL'].apply(lambda x: str(x))

fig = px.scatter(DF,
                 x='Width_in',
                 y="Height_in",
                 color='landsALL',
                 opacity=0.2,
                 trendline='ols')

fig.update_layout(template="ggplot2",
                  title='Height vs Width of Paris Paintings <br><sup> Paris auctions, 1764-1780</sup>',
                  title_x=0.5,
                  xaxis_title="Width (inches)",
                  yaxis_title="Height (inches)")

fig.show()

IN the plot above we added another variable by coloring by whether or not there were landscape features in the painting. Then the trendline='ols' now gives us two lines. Here there is some evidence that if a painting in landscape, then it tends to be wider than it is tall.

**Exercise 1** Redo the plot above except color by some other variable that takes values of zero or one. Describe what the ols trendline tells you about the height and width of that type of painting. Eg. Does a painting being described as pastoral mean it is taller or wider than if it is not pastoral?

1. Choose a column that has just 0 and 1 as entries
2. Change those values into strings using apply and lambda
3. Do a value counts and talk about the balance in the data
4. Create a scatter plot with an old trendline colored by your focal column
5. Describe in words what you plot is telling you.

In [50]:
DF['relig'] = DF['relig'].apply(lambda x: str(x))
DF['relig'].value_counts()


relig
0    2826
1     567
Name: count, dtype: int64

We see there is 2826 paintings that aren't apinted for religion and 567 that are painted for religion.

In [51]:
fig = px.scatter(DF,
                 x='Width_in',
                 y="Height_in",
                 color='relig',
                 opacity=0.2,
                 trendline='ols')

fig.update_layout(template="ggplot2",
                  title='Height vs Width of Paris Paintings <br><sup> Paris auctions, 1764-1780</sup>',
                  title_x=0.5,
                  xaxis_title="Width (inches)",
                  yaxis_title="Height (inches)")

fig.show()


## Install Scikit-Learn

Run the code below to install sklearn.

```{python}
    !conda install -y scikit-learn
```

In [15]:
 !conda install -y scikit-learn

Channels:
 - conda-forge
 - defaults
 - anaconda
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/anaconda3

  added / updated specs:
    - scikit-learn


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    joblib-1.4.2               |     pyhd8ed1ab_0         215 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         215 KB

The following NEW packages will be INSTALLED:

  joblib             conda-forge/noarch::joblib-1.4.2-pyhd8ed1ab_0 
  scikit-learn       conda-forge/osx-arm64::scikit-learn-1.5.2-py312h387f99c_1 
  scipy              conda-forge/osx-arm64::scipy-1.14.1-py312h20deb59_1 
  threadpoolctl      conda-forge/noarch::threadpoolctl-3.5.0-pyhc1e730c_0 



Downloading and Extracting Packages:
           

In [16]:
# A new packages to import!
from sklearn.linear_model import LinearRegression 
from sklearn.preprocessing import OneHotEncoder

**Exercise 2** Redo the Linear Regression from the lecture to see if you can use the size ('Surface') of the painting to predict the price.

1. Create a data frame with only the price and surface columns
2. Look at the data types
3. Preprocess the data - remove NaNs and change the Surface values to floats.
5. Train a linear regression model that takes as an input X = Surface Area and gives as an output y=price

           X = DF_model['Surface'].values.reshape(-1,1)
           y = DF_model['price'].values
7. Plot a scatter plot of Price vs Surface add in your predicted line
8. Find your slope and intercept
9. Look at the score

Interpret your results. Should you use a linear model to predict the price of a painting using the surface area?

*In the lecture you can see the scatter plot and score that I got.*


### Preprocessing the Data

Before you can build a model you need to do some cleaning and preprocessing of your data. Here are some important steps:

1. Select the variables that you wan to use (columns)
2. Decide what to do about NaNs or other strange data
3. (*advanced*) Think about rescaling and standardizing
4. Create the inputs and outputs (sometimes encode)
5. (*advanced*) Test - Train split

### Train the model

1. Create the base model, in this case LinearRegression()
2. Train the model using the training data
3. Look at the results.


In [78]:
DF_model = DF[['Surface', 'price']]
DF_model.dtypes


Surface    float64
price      float64
dtype: object

In [79]:
print('Number of NaNs:')
print(DF_model.isna().sum().sum())
print('----------------------')

print('Percent NaNs:')
print(DF_model.isna().sum().sum()/len(DF))
print('----------------------')
#Lets drop these
DF_model.dropna(inplace=True)
print('Number of NaNs after drop:')
print(DF_model.isna().sum().sum())
print('----------------------')

Number of NaNs:
176
----------------------
Percent NaNs:
0.051871500147362214
----------------------
Number of NaNs after drop:
0
----------------------


In [80]:
DF['Surface'] = DF['Surface'].apply(lambda x: float(x))
DF_model.dtypes

Surface    float64
price      float64
dtype: object

In [81]:
X = DF_model['Surface'].values.reshape(-1, 1) 
y = DF_model['price'].values 

LM = LinearRegression()
LM.fit(X, y)

fig=px.scatter(DF, 
                x='Surface',
                y='price', 
                color_discrete_sequence=['black'], 
                opacity=0.2)

fig.update_layout(template="ggplot2",
                  title = "Price as a function of Surface Area",
                  xaxis_title = "Surface",
                  yaxis_title = "price") 
 
DF_model = pd.DataFrame()
DF_model['s'] = DF['Surface']
DF_model['y'] = LM.coef_[0]*DF_model['s']+LM.intercept_
DF_model = DF_model.sort_values('s')

fig.add_trace(
    px.line(DF_model, x='s',
y='y', color_discrete_sequence=['blue']).data[0]
)

fig.show()

slope = LM.coef_[0]
intercept = LM.intercept_

print(f"Slope: {slope}")
print(f"Intercept: {intercept}")

score = LM.score(X, y)
print(f"R² Score: {score}")

Slope: 0.18762629558739818
Intercept: 660.1215276728364
R² Score: 0.011141075251450583


The R^2 score of 0.011 is very poor indicating the linear fit does not provide a good relationship with the data. Thus, any inferences drawn from the trend line data are essentially irrelevant as the trend line fits the data very poorly. Therefore, it is diificult to say that as surface area increases so too does price as the trend line would suggest.

**Exercise 3** Redo the analysis for Linear Regression with more than one categorical value (from the lecture) except this time see if the school of the painting effects the overall size ('Surface') of the painting. Don't forget to drop the NaNs where we don't know the surface size and change the surface variables to floats.

1. Create a data frame with only the school and surface columns
2. Look at the data types
3. Preprocess the data - remove NaNs and change the Surface values to floats.
5. Train a linear regression model that takes as an input X = School and gives as an output y= surface area, now you need to one hot encode the X values!

           X = DF_model['school_pntg'].values.reshape(-1,1)
           y = DF_model['Surface'].values
   
9. Look at the output, what does it mean?
        - Which school of paintings on average are largest? smallest?

*You can see the outputs of my code in the lecture*


In [110]:
DF_model = DF[['school_pntg', 'Surface']]

print("Data types before preprocessing:")
print(DF_model.dtypes)



Data types before preprocessing:
school_pntg     object
Surface        float64
dtype: object


In [111]:
print('Number of NaNs:')
print(DF_model.isna().sum().sum())
print('----------------------')

print('Percent NaNs:')
print(DF_model.isna().sum().sum()/len(DF))
print('----------------------')
#Lets drop these
DF_model.dropna(inplace=True)
print('Number of NaNs after drop:')
print(DF_model.isna().sum().sum())
print('----------------------')

Number of NaNs:
176
----------------------
Percent NaNs:
0.051871500147362214
----------------------
Number of NaNs after drop:
0
----------------------


In [112]:
DF_model['school_pntg'].value_counts()

school_pntg
D/FL    1463
F       1294
I        406
X         40
S          7
G          5
A          2
Name: count, dtype: int64

In [117]:
categories = encoder.categories_[0]
categories

array(['A', 'D/FL', 'F', 'G', 'I', 'S', 'X'], dtype=object)

In [118]:
# Look at each category and encoding
result = DF_model.groupby('school_pntg',as_index=False).first()
encoded_data = encoder.transform(result['school_pntg'].values.reshape(-1,1))



for i,e in enumerate(encoded_data.toarray()):
    print(categories[i])
    print(e)
    print('---------------------------')

A
[1. 0. 0. 0. 0. 0. 0.]
---------------------------
D/FL
[0. 1. 0. 0. 0. 0. 0.]
---------------------------
F
[0. 0. 1. 0. 0. 0. 0.]
---------------------------
G
[0. 0. 0. 1. 0. 0. 0.]
---------------------------
I
[0. 0. 0. 0. 1. 0. 0.]
---------------------------
S
[0. 0. 0. 0. 0. 1. 0.]
---------------------------
X
[0. 0. 0. 0. 0. 0. 1.]
---------------------------


In [119]:
X = DF_model['school_pntg'].values.reshape(-1,1)
y = DF_model['Surface'].values

# Now because X has lots of categories, we need to encode it:
encoder = OneHotEncoder()
X = encoder.fit_transform(X)

LM = LinearRegression()

# Train the model using the data
LM.fit(X, y)


# Look at the information we get
print(LM.coef_)
print('------------------------------')
print(LM.intercept_)

[-392.07405546 -266.70348061  122.28772343 -393.21155591  142.65081982
 1051.64022987 -264.58968116]
------------------------------
686.0740423489735


The average base surface area of painting is approximately 686.07. We can see that Spanish schools have the largest average painting with a surface area of 1051.64 more and German schools have the smallest with a surface area of -393.21 less.