### Least Squares

In [2]:
import numpy as np
import pandas as pd

In lecture we considered a special case of this (where the mean of b is 0) here is the full story.  Suppose you are solving a least squares problem $\min_x \| Ax-b  \|$ and $x_0$ is the unique solution. The error term $\| Ax_0 - b  \|$ describes how far off you are. In isolation this error term is basically meaningless, but when correctly normalized it contains valuable information.  Define the sum-squared error (SSE) as

$ \text{SSE} = \| Ax_0 - b \|^2  $

and the total sum of squares (SSTO) as

$\text{SSTO} = \| b \|^2 - \frac{1}{n}b^T\text{np.ones}((n,n))b$ where $n,1$ = b.shape

Finally define $R^2 = 1 - \frac{\text{SSE}}{\text{SSTO}}$

Remark:  If the mean of b is 0 then SSTO reduces to $\| b \|^2.$  The extra term in the general case takes into account the fact that the mean may not be 0.

Although not totally obvious we always have $0\leq R^2\leq 1$ (it's an exercise you could do).  Roughly speaking, but not always correct, the higher the $R^2$ the better the job our linear model does to predict the outcome.  A rough interpretation of $R^2$ as a percentage: Say $R^2=.73$, then the model accounts for $73$% of the variance in the data. 

These rough interpretations are all we will need for this assignment--for more information take a course on linear regression or read a book about it, like this [one](https://books.google.com/books/about/Applied_Linear_Regression_Models.html?id=2sl_QgAACAAJ) that is pretty easy to find online.

#### Global variables

We define the global variables dfr and dfo outside of your functions but you will reference them inside of your functions. We differentiate these from *local* variables that you define inside of your functions. In general global variables are discouraged but in this case it makes sense so you don't have to import large files every time you want to access them in a function.

In [4]:
dfr = pd.read_csv('retro.csv')
dfo = pd.read_csv('offense.csv')

In [5]:
dfr.head()

Unnamed: 0.1,Unnamed: 0,year,runs,1B,2B,3B,HR,HBP,BB,IBB,SO,SB,CS,GDP,ParkID
0,0,1984,5,6,1,0,0,0,6,2,7,1,0,0,BAL11
1,1,1984,2,6,1,0,1,0,1,0,4,0,0,0,BAL11
2,2,1984,1,5,1,0,0,0,0,0,8,0,1,0,ANA01
3,3,1984,2,5,1,0,0,0,3,1,4,0,0,0,ANA01
4,4,1984,1,6,0,0,1,1,1,0,8,0,1,0,CIN08


In [6]:
dfo.head()

Unnamed: 0,Season,Name,Team,G,AB,PA,H,1B,2B,3B,...,HBP,SF,SH,GDP,SB,CS,AVG,NameASCII,PlayerId,MLBAMID
0,2020,DJ LeMahieu,NYY,50,195,216,71,49,10,2,...,2,1,0,3,3,0,0.364103,DJ LeMahieu,9874,518934
1,2023,Luis Arraez,MIA,147,574,617,203,160,30,3,...,4,3,1,18,3,2,0.353659,Luis Arraez,18568,650333
2,2020,Juan Soto,WSN,47,154,196,54,27,14,0,...,1,0,0,1,6,2,0.350649,Juan Soto,20123,665742
3,2016,DJ LeMahieu,COL,146,552,635,192,141,32,8,...,3,6,8,19,11,7,0.347826,DJ LeMahieu,9874,518934
4,2016,Daniel Murphy,WSN,142,531,582,184,107,47,5,...,8,8,0,4,5,3,0.346516,Daniel Murphy,4316,502517


In [7]:
dfr.shape, dfo.shape

((181832, 15), (1522, 26))

In [None]:
def lin_weights(year,park):  #You are calculating 'linear weights', hence the name
    """Calculates linear weights for year and park
    
       Use least squares to calculate a 1x7 array b that best 
       approximates b[0] + BB*b[1] + 1B*b[2] + 2B*b[3] + 3B*b[4] + HR*b[5] = runs
       for the given year and park.  The final entry b[6] is the R^2 value 
       (see the above cell on how to calculate R^2).
       
       Parameters
       ----------
       year : int
       park : string
           year corresponds to the 'year' heading in dfr
           park correspond to the 'ParkID' heading in dfr
           
       Returns
       -------
       1x6 numpy array (We don't care about b[0] so don't return it)
        
    """
    pass
    #Use the global variable dfr inside of your function
    years = dfr.loc[dfr['year']==year]
    parks = dfr.loc[dfr['ParkID']==park]
    cols = ['BB', '1B', '2B', '3B', 'HR']
    

In [11]:
years = dfr.loc[dfr['year']==1999]

In [12]:
years.head()

Unnamed: 0.1,Unnamed: 0,year,runs,1B,2B,3B,HR,HBP,BB,IBB,SO,SB,CS,GDP,ParkID
63526,63526,1999,8,16,1,0,1,0,2,0,6,2,0,1,MNT01
63527,63527,1999,2,5,1,0,0,0,6,0,8,1,0,1,MNT01
63528,63528,1999,7,7,3,0,2,2,5,0,10,1,0,0,BAL12
63529,63529,1999,10,9,2,0,2,0,6,0,7,0,0,2,BAL12
63530,63530,1999,5,8,3,1,1,1,4,0,2,1,0,4,KAN06


Test!!! Many years and many parks. Do your values make sense?  You should expect R^2 values roughly in the .65-.8 range.

### Use lin_weights to make a report

Choose a ParkID and a ten-year span (Not all parks exist in every 10 year span!).  You will record the linear weight of 'BB', '1B', etc over the span of that 10 year period in that park and produce a csv file with this information. 

Create a DataFrame that has columns: year, BB,1B,2B,3B,HR and in each row lists the year and the corresponding linear weight.

Then save your DataFrame as a csv file that has the name of your park in the filename.  All of this code should go into a function.

Hardcoding is to be expected in this function

In [3]:
#This example might be helpful in making the DataFrame of your report!
#I should have put it in the video tutorial!
my_columns = ['BB','HR']
my_array = np.random.rand(5,2)
my_df = pd.DataFrame(my_array,columns=my_columns)
my_df

Unnamed: 0,BB,HR
0,0.95796,0.595174
1,0.977735,0.723081
2,0.699355,0.76947
3,0.255901,0.461769
4,0.555207,0.766835


In [None]:
def make_park_report():
    pass

I'll will run your function make_report() and it should spit out the csv report.

### Use lin_weights to assign "context free" credit to each player for how 

Choose a ball park that existed in 2023. Make it a different one than the one you chose in the previous report.  Use lin_weights to estimate how many runs each player from 2023 would score in your ball park. This can be acheived with one matrix multiplication--no loops!

Create a DataFrame that has two columns: 'Name' and 'runs'. In each row it records the name of the player and the corresponding estimated runs.

Then save your DataFrame as a csv file that has the name of your park in the filename.  All of this code should go into a function.

Hardcoding is to be expected in this function

In [None]:
def make_player_report():
    #Uses the global variable dff inside your function
    pass

### Tutorial Follow-along

In [3]:
df=pd.read_csv('retro.csv')

In [4]:
df.shape

(181832, 15)

In [5]:
df.head()

Unnamed: 0.1,Unnamed: 0,year,runs,1B,2B,3B,HR,HBP,BB,IBB,SO,SB,CS,GDP,ParkID
0,0,1984,5,6,1,0,0,0,6,2,7,1,0,0,BAL11
1,1,1984,2,6,1,0,1,0,1,0,4,0,0,0,BAL11
2,2,1984,1,5,1,0,0,0,0,0,8,0,1,0,ANA01
3,3,1984,2,5,1,0,0,0,3,1,4,0,0,0,ANA01
4,4,1984,1,6,0,0,1,1,1,0,8,0,1,0,CIN08


In [8]:
# Grabs all values in column 'SO' with value greater than or equal to 7
df1=df.loc[df['SO']<=7]

In [10]:
df1.head()


Unnamed: 0.1,Unnamed: 0,year,runs,1B,2B,3B,HR,HBP,BB,IBB,SO,SB,CS,GDP,ParkID
0,0,1984,5,6,1,0,0,0,6,2,7,1,0,0,BAL11
1,1,1984,2,6,1,0,1,0,1,0,4,0,0,0,BAL11
3,3,1984,2,5,1,0,0,0,3,1,4,0,0,0,ANA01
5,5,1984,8,5,5,0,2,0,2,0,3,1,0,1,CIN08
6,6,1984,2,2,0,0,1,0,1,0,4,0,1,0,KAN06


In [12]:
# Grabs all rows with parkID cols equal to CIN09 and year cols greater than 2021 
df2=df.loc[(df['ParkID']=='CIN09') & (df['year']>2021)]

In [13]:
df2.head()

Unnamed: 0.1,Unnamed: 0,year,runs,1B,2B,3B,HR,HBP,BB,IBB,SO,SB,CS,GDP,ParkID
172250,172250,2022,10,4,3,2,2,1,5,0,8,1,3,0,CIN09
172251,172251,2022,5,1,3,0,1,0,3,0,8,0,0,2,CIN09
172280,172280,2022,7,7,2,0,4,2,4,0,10,1,0,0,CIN09
172281,172281,2022,3,3,2,0,2,2,0,0,12,1,0,0,CIN09
172516,172516,2022,4,10,0,0,0,1,6,0,7,0,0,1,CIN09


In [14]:
df2.shape

(320, 15)

In [15]:
cols=['SO', 'SB', 'CS']

In [16]:
# Grabs all the rows and displays only the columns listed in 'cols'
df3=df.loc[:, cols]

In [17]:
df3.head()

Unnamed: 0,SO,SB,CS
0,7,1,0
1,4,0,0
2,8,0,1
3,4,0,0
4,8,0,1


In [18]:
df3.shape

(181832, 3)

### Extracting a numpy array

In [19]:
# export above as a numpy array for use in image processing
df3_array=np.array(df3, dtype=np.float64)

In [None]:
# initialized with rows 1-5 and includes all columns
df3_array[0:5, :]

In [20]:
df3_array

array([[ 7.,  1.,  0.],
       [ 4.,  0.,  0.],
       [ 8.,  0.,  1.],
       ...,
       [10.,  0.,  0.],
       [ 9.,  2.,  0.],
       [ 7.,  0.,  0.]])

### Augment dataframes and export

In [22]:
# final array to augment
df3.head()

Unnamed: 0,SO,SB,CS
0,7,1,0
1,4,0,0
2,8,0,1
3,4,0,0
4,8,0,1


In [23]:
m,_=df3.shape

In [24]:
# Adds a column with all entries 1
A=np.ones((m, 1))

In [25]:
df3.insert(1, 'dumb', A)

In [26]:
df3.head()

Unnamed: 0,SO,dumb,SB,CS
0,7,1.0,1,0
1,4,1.0,0,0
2,8,1.0,0,1
3,4,1.0,0,0
4,8,1.0,0,1


In [28]:
# exports to a csv
df3.to_csv('df3.csv')