# Linear Regression Modelling

In [1]:
library(dplyr)
library(ggplot2)
library(readr)
library(tidyr)
library(corrplot)


Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union


corrplot 0.84 loaded



In [2]:
#Data Set
library(gapminder)

In [3]:
gapminder[1:10,]

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
Afghanistan,Asia,1952,28.801,8425333,779.4453
Afghanistan,Asia,1957,30.332,9240934,820.853
Afghanistan,Asia,1962,31.997,10267083,853.1007
Afghanistan,Asia,1967,34.02,11537966,836.1971
Afghanistan,Asia,1972,36.088,13079460,739.9811
Afghanistan,Asia,1977,38.438,14880372,786.1134
Afghanistan,Asia,1982,39.854,12881816,978.0114
Afghanistan,Asia,1987,40.822,13867957,852.3959
Afghanistan,Asia,1992,41.674,16317921,649.3414
Afghanistan,Asia,1997,41.763,22227415,635.3414


In [4]:
head(gapminder, 10)

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
Afghanistan,Asia,1952,28.801,8425333,779.4453
Afghanistan,Asia,1957,30.332,9240934,820.853
Afghanistan,Asia,1962,31.997,10267083,853.1007
Afghanistan,Asia,1967,34.02,11537966,836.1971
Afghanistan,Asia,1972,36.088,13079460,739.9811
Afghanistan,Asia,1977,38.438,14880372,786.1134
Afghanistan,Asia,1982,39.854,12881816,978.0114
Afghanistan,Asia,1987,40.822,13867957,852.3959
Afghanistan,Asia,1992,41.674,16317921,649.3414
Afghanistan,Asia,1997,41.763,22227415,635.3414


## Theory

#### Truth:
Assume $f(x) = \beta_{o} + \beta_{1}x $ <br>
Observed $y = f(x)+\epsilon = \beta_{o} + \beta_{1}x + \epsilon $ <br>

#### Fitted:
Assume $\hat f(x) = \hat\beta_{o} + \hat\beta_{1}x $ <br>
Predicted $\hat y = \hat f(\vec x)= \hat\beta_{o} + \hat\beta_{1}x$ <br>

The fitted model has no $\epsilon$ since it only captures signal and not noise.

In [5]:
life_pop.regression = lm(pop~lifeExp, data =gapminder)

In [6]:
life_pop.regression


Call:
lm(formula = pop ~ lifeExp, data = gapminder)

Coefficients:
(Intercept)      lifeExp  
   -2147963       533829  


The equation generated by this linear regression would be: <br>
$Y = -21747693 + 533829*X$

In [7]:
summary(life_pop.regression)


Call:
lm(formula = pop ~ lifeExp, data = gapminder)

Residuals:
       Min         1Q     Median         3Q        Max 
 -41194356  -27908251  -19388365   -8866658 1281882368 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) -2147963   12098247  -0.178  0.85910   
lifeExp       533829     198788   2.685  0.00731 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.06e+08 on 1702 degrees of freedom
Multiple R-squared:  0.004219,	Adjusted R-squared:  0.003634 
F-statistic: 7.212 on 1 and 1702 DF,  p-value: 0.007314


#### $R^{2}$ only explains  0.4% of variation which mean life-expectancy cannot completely explain the variation of population.

#### Moderndive package helps convert regression summary to table

In [9]:
library(moderndive)

In [10]:
get_regression_table(life_pop.regression)

term,estimate,std_error,statistic,p_value,lower_ci,upper_ci
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
intercept,-2147962.8,12098246.6,-0.178,0.859,-25876964.9,21581039.3
lifeExp,533828.9,198787.5,2.685,0.007,143935.2,923722.6


In [11]:
get_regression_summaries(life_pop.regression)

r_squared,adj_r_squared,mse,rmse,sigma,statistic,p_value,df,nobs
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0.004,0.004,1.121537e+16,105902621,105964825,7.212,0.007,1,1704


In [13]:
library(ggmap)

In [14]:
get_local_spot <-  get_map("Mumbai", maptype = "roadmap", zoom = 10) 

ERROR: Error: Google now requires an API key.
       See ?register_google for details.


In [17]:
library(mapview)

In [18]:
starbucks <- read_csv("https://raw.githubusercontent.com/libjohn/mapping-with-R/master/data/All_Starbucks_Locations_in_the_US_-_Map.csv")

Parsed with column specification:
cols(
  .default = col_character(),
  `Store Number` = [32mcol_double()[39m,
  `Facility ID` = [32mcol_double()[39m,
  `Food Region` = [32mcol_double()[39m,
  Latitude = [32mcol_double()[39m,
  Longitude = [32mcol_double()[39m
)

See spec(...) for full column specifications.



In [21]:
starbucksNC <- starbucks  %>% 
  filter(State == "NC")

starbucksNC[1:3,]

Brand,Store Number,Name,Ownership Type,Facility ID,Features - Products,Features - Service,Features - Stations,Food Region,Venue Type,...,Street Line 1,Street Line 2,City,State,Zip,Country,Coordinates,Latitude,Longitude,Insert Date
<chr>,<dbl>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<dbl>,<chr>,...,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<chr>
Starbucks,72949,Farm Fresh-Elizabeth City #469,Licensed,15736,,,,9999,Unknown,...,691 S Hughes Blvd,,Elizabeth City,NC,27909-4530,US,"(36.290986, -76.25259)",36.29099,-76.25259,06/22/2012 06:31:38 PM
Starbucks,9272,Roanoke Rapids,Company Owned,14789,"Lunch, Oven-warmed Food","Starbucks Card Mobile, Wireless Hotspot",Drive-Through,9999,Unknown,...,298 Premier Blvd.,,Roanoke Rapids,NC,27870-5076,US,"(36.4324, -77.6388)",36.4324,-77.6388,06/22/2012 06:31:38 PM
Starbucks,13258,Kitty Hawk,Company Owned,10765,Oven-warmed Food,Starbucks Card Mobile,,9999,Unknown,...,5597-A North Croatan Highway,,Kitty Hawk,NC,27949-4090,US,"(36.098345, -75.724577)",36.09835,-75.72458,06/22/2012 06:31:38 PM


In [26]:
library(sf)
library(leaflet)

Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1



In [29]:
sbux_sf <- st_as_sf(starbucksNC, coords = c("Longitude", "Latitude"),  crs=4326)

In [28]:
mapview(starbucksNC, xcol = "Longitude", ycol = "Latitude", crs = 4269, grid = FALSE)

In [30]:
mapview(sbux_sf)

In [31]:
mapview(breweries)