# ISYS 622 Project #2
By Tanner Hefflefinger and Morgan Kaiser

***
### Description of Dataset

Data was selected from the large online clickstream data collected in 2011 by tracking over 100,000 unique household online shopping behavior. This small sample includes transactions for booking hotels online.

**Here's what the columns represent:**

* **ID**: unique transaction ID
* **DOMAIN_ID**: unique ID for the web domain
* **MACHINE_ID**: unique ID for the household on which the transaction was made
* **SITE_SESSION_ID**: unique ID for the session in which the transaction was made
* **TRANS_FREQ**: total # of transactions for the household
* **DOMAIN_NAME**: the website (aka domain) name where the transaction was made
* **DIRECT_D**: dummy variable. 1 = transaction incurred directly from a hotel website. 0 = transaction incurred from a third party travel website
* **PROD_NAME**: the product purchased by the household. ex: hotel or packages
* **PROD_TOTPRICE**: total price paid for this transaction
* **REF_DOMAIN_NAME**: the referring website (aka domain) name through which the final purchase websited was made
* **DURATION**: total time spent at a site (in minutes)
* **PAGES_VIEWED**: total pages viewed at a site
* **HOUSEHOLD_SIZE**: total # of people in the household
* **CHILDREN_D**: dummary variable. indicates whether household has any children
* **CONNECTIONSPEED_D**: dummy variable. indicates whether household has a high speed connection
***

Create the following 2 additional variables into your data:

- REF_D
    - dummy variable indicating whether the transaction was referenced from other website or if the final booking website was directly accessed.
    - If no information provided for the variable REF_DOMAIN_NAME..
        - REF_D = 0; otherwise REF_D = 1
- LOG_PRICE
    - take the log transformation of the variable PROD_TOTPRICE

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_excel('HotelClickStream.xls')
data.head()

Unnamed: 0,ID,DOMAIN_ID,MACHINE_ID,SITE_SESSION_ID,TRANS_FREQ,DOMAIN_NAME,DIRECTP_D,PROD_NAME,PROD_QTY,PROD_TOTPRICE,REF_DOMAIN_NAME,DURATION,PAGES_VIEWED,HOUSEHOLD_SIZE,CHILDREN_D,CONNECTIONSPEED_D
0,1525,13877604970862366012,85643811,4447900536932,1,ichotelsgroup.com,1,FT. LAUDERDALE AIRPORT/CRUISE - CROWNE PLAZA H...,32,2847.039993,,23.328125,13,6,1,1
1,402,7101213156062330967,76460408,71774258860245,1,orbitz.com,0,WALT DISNEY WORLD MAGIC YOUR WAY TICKETS! N/A,1,2406.939995,yahoo.com,47.109375,17,2,1,1
2,233,7772350535129410931,74286590,3825866182640,1,hyatt.com,1,HYATT REGENCY MAUI RESORT SPA FRI 11 MAR 2011...,5,2168.0,google.com,20.058594,19,1,0,1
3,2362,9530952911301729568,90015830,70000481538306,1,expedia.com,0,HOTEL - THE ADDRESS DUBAI MARINA ~SAT DEC/10/2...,5,1958.699997,,47.546875,39,1,0,1
4,2738,4024709573451844450,91435029,5158448795791,2,starwoodhotels.com,1,HOTEL-W NEW YORK - TIMES SQUARE 08/18~08/21,3,1797.0,whotels.com,14.599609,19,1,0,1


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3749 entries, 0 to 3748
Data columns (total 16 columns):
ID                   3749 non-null int64
DOMAIN_ID            3749 non-null uint64
MACHINE_ID           3749 non-null int64
SITE_SESSION_ID      3749 non-null int64
TRANS_FREQ           3749 non-null int64
DOMAIN_NAME          3749 non-null object
DIRECTP_D            3749 non-null int64
PROD_NAME            3749 non-null object
PROD_QTY             3749 non-null int64
PROD_TOTPRICE        3749 non-null float64
REF_DOMAIN_NAME      1687 non-null object
DURATION             3749 non-null float64
PAGES_VIEWED         3749 non-null int64
HOUSEHOLD_SIZE       3749 non-null int64
CHILDREN_D           3749 non-null int64
CONNECTIONSPEED_D    3749 non-null int64
dtypes: float64(2), int64(10), object(3), uint64(1)
memory usage: 468.7+ KB


REF_DOMAIN_NAME is the only column in dataset that has nulls. Replace them with 0 for this project.

In [4]:
data['REF_DOMAIN_NAME'] = data['REF_DOMAIN_NAME'].replace(np.nan, 0)

Check data again...

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3749 entries, 0 to 3748
Data columns (total 16 columns):
ID                   3749 non-null int64
DOMAIN_ID            3749 non-null uint64
MACHINE_ID           3749 non-null int64
SITE_SESSION_ID      3749 non-null int64
TRANS_FREQ           3749 non-null int64
DOMAIN_NAME          3749 non-null object
DIRECTP_D            3749 non-null int64
PROD_NAME            3749 non-null object
PROD_QTY             3749 non-null int64
PROD_TOTPRICE        3749 non-null float64
REF_DOMAIN_NAME      3749 non-null object
DURATION             3749 non-null float64
PAGES_VIEWED         3749 non-null int64
HOUSEHOLD_SIZE       3749 non-null int64
CHILDREN_D           3749 non-null int64
CONNECTIONSPEED_D    3749 non-null int64
dtypes: float64(2), int64(10), object(3), uint64(1)
memory usage: 468.7+ KB


#### REF_D

We are creating a new variable called REF_D from the REF_DOMAIN_NAME column.

REF_DOMAIN_NAME is of object datatype which means it is categorical and needs to be converted to a numerical variable. 

In [6]:
def label_domain (row):
    if row['REF_DOMAIN_NAME'] == 0:
        return '0'
    else:
        return '1'

In [7]:
data['REF_D'] = data.apply (lambda row: label_domain(row), axis=1)

In [8]:
data.head()

Unnamed: 0,ID,DOMAIN_ID,MACHINE_ID,SITE_SESSION_ID,TRANS_FREQ,DOMAIN_NAME,DIRECTP_D,PROD_NAME,PROD_QTY,PROD_TOTPRICE,REF_DOMAIN_NAME,DURATION,PAGES_VIEWED,HOUSEHOLD_SIZE,CHILDREN_D,CONNECTIONSPEED_D,REF_D
0,1525,13877604970862366012,85643811,4447900536932,1,ichotelsgroup.com,1,FT. LAUDERDALE AIRPORT/CRUISE - CROWNE PLAZA H...,32,2847.039993,0,23.328125,13,6,1,1,0
1,402,7101213156062330967,76460408,71774258860245,1,orbitz.com,0,WALT DISNEY WORLD MAGIC YOUR WAY TICKETS! N/A,1,2406.939995,yahoo.com,47.109375,17,2,1,1,1
2,233,7772350535129410931,74286590,3825866182640,1,hyatt.com,1,HYATT REGENCY MAUI RESORT SPA FRI 11 MAR 2011...,5,2168.0,google.com,20.058594,19,1,0,1,1
3,2362,9530952911301729568,90015830,70000481538306,1,expedia.com,0,HOTEL - THE ADDRESS DUBAI MARINA ~SAT DEC/10/2...,5,1958.699997,0,47.546875,39,1,0,1,0
4,2738,4024709573451844450,91435029,5158448795791,2,starwoodhotels.com,1,HOTEL-W NEW YORK - TIMES SQUARE 08/18~08/21,3,1797.0,whotels.com,14.599609,19,1,0,1,1


In [9]:
data['REF_D'] = data['REF_D'].astype(int)

In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3749 entries, 0 to 3748
Data columns (total 17 columns):
ID                   3749 non-null int64
DOMAIN_ID            3749 non-null uint64
MACHINE_ID           3749 non-null int64
SITE_SESSION_ID      3749 non-null int64
TRANS_FREQ           3749 non-null int64
DOMAIN_NAME          3749 non-null object
DIRECTP_D            3749 non-null int64
PROD_NAME            3749 non-null object
PROD_QTY             3749 non-null int64
PROD_TOTPRICE        3749 non-null float64
REF_DOMAIN_NAME      3749 non-null object
DURATION             3749 non-null float64
PAGES_VIEWED         3749 non-null int64
HOUSEHOLD_SIZE       3749 non-null int64
CHILDREN_D           3749 non-null int64
CONNECTIONSPEED_D    3749 non-null int64
REF_D                3749 non-null int32
dtypes: float64(2), int32(1), int64(10), object(3), uint64(1)
memory usage: 483.3+ KB


#### LOG_PRICE

Some prices are 0, therefore, np.log() does not work.

Use np.log1p() instead.

In [11]:
data['LOG_PRICE'] = np.log1p(data['PROD_TOTPRICE'])

In [12]:
data.head()

Unnamed: 0,ID,DOMAIN_ID,MACHINE_ID,SITE_SESSION_ID,TRANS_FREQ,DOMAIN_NAME,DIRECTP_D,PROD_NAME,PROD_QTY,PROD_TOTPRICE,REF_DOMAIN_NAME,DURATION,PAGES_VIEWED,HOUSEHOLD_SIZE,CHILDREN_D,CONNECTIONSPEED_D,REF_D,LOG_PRICE
0,1525,13877604970862366012,85643811,4447900536932,1,ichotelsgroup.com,1,FT. LAUDERDALE AIRPORT/CRUISE - CROWNE PLAZA H...,32,2847.039993,0,23.328125,13,6,1,1,0,7.954386
1,402,7101213156062330967,76460408,71774258860245,1,orbitz.com,0,WALT DISNEY WORLD MAGIC YOUR WAY TICKETS! N/A,1,2406.939995,yahoo.com,47.109375,17,2,1,1,1,7.786527
2,233,7772350535129410931,74286590,3825866182640,1,hyatt.com,1,HYATT REGENCY MAUI RESORT SPA FRI 11 MAR 2011...,5,2168.0,google.com,20.058594,19,1,0,1,1,7.682022
3,2362,9530952911301729568,90015830,70000481538306,1,expedia.com,0,HOTEL - THE ADDRESS DUBAI MARINA ~SAT DEC/10/2...,5,1958.699997,0,47.546875,39,1,0,1,0,7.580547
4,2738,4024709573451844450,91435029,5158448795791,2,starwoodhotels.com,1,HOTEL-W NEW YORK - TIMES SQUARE 08/18~08/21,3,1797.0,whotels.com,14.599609,19,1,0,1,1,7.49443


In [13]:
data.columns

Index(['ID', 'DOMAIN_ID', 'MACHINE_ID', 'SITE_SESSION_ID', 'TRANS_FREQ',
       'DOMAIN_NAME', 'DIRECTP_D', 'PROD_NAME', 'PROD_QTY', 'PROD_TOTPRICE',
       'REF_DOMAIN_NAME', 'DURATION', 'PAGES_VIEWED', 'HOUSEHOLD_SIZE',
       'CHILDREN_D', 'CONNECTIONSPEED_D', 'REF_D', 'LOG_PRICE'],
      dtype='object')

***
# Part 1 - Logit

Use Binary Outcome (Logistic/Logit) regression to answer..

**“What are the factors that influence people’s decision on whether to book directly on a hotel website or from other third party website?”**

- DV = DIRECTP_D
- IV(s) = REF_D, LOG_PRICE, TRANS_FREQ, DURATION, HOUSEHOLD_SIZE, CHILDREN_D, CONNECTIONSPEED_D

Report and interpret regression results -> include interpretation of each of the regression coefficients

**BONUS!!**

Given the regression results, your interpretation, and your experience/research on internet shopping, what kind of improvements should be made to the model?
- IVs to be removed? OR 
- new IVs to be added? OR
- Other regression methodologies?

In [14]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3749 entries, 0 to 3748
Data columns (total 18 columns):
ID                   3749 non-null int64
DOMAIN_ID            3749 non-null uint64
MACHINE_ID           3749 non-null int64
SITE_SESSION_ID      3749 non-null int64
TRANS_FREQ           3749 non-null int64
DOMAIN_NAME          3749 non-null object
DIRECTP_D            3749 non-null int64
PROD_NAME            3749 non-null object
PROD_QTY             3749 non-null int64
PROD_TOTPRICE        3749 non-null float64
REF_DOMAIN_NAME      3749 non-null object
DURATION             3749 non-null float64
PAGES_VIEWED         3749 non-null int64
HOUSEHOLD_SIZE       3749 non-null int64
CHILDREN_D           3749 non-null int64
CONNECTIONSPEED_D    3749 non-null int64
REF_D                3749 non-null int32
LOG_PRICE            3749 non-null float64
dtypes: float64(3), int32(1), int64(10), object(3), uint64(1)
memory usage: 512.6+ KB


In [16]:
import statsmodels.api as sm

In [17]:
# step 1: create x and y; add constant

x = sm.add_constant(data[['REF_D', 'LOG_PRICE', 'TRANS_FREQ','DURATION',
                          'HOUSEHOLD_SIZE', 'CHILDREN_D','CONNECTIONSPEED_D']])
y = data.DIRECTP_D

  return ptp(axis=axis, out=out, **kwargs)


In [18]:
# step 2: build model

logit_mod=sm.Logit(y,x)

In [19]:
# step 3: fit the model

logit_res = logit_mod.fit()

Optimization terminated successfully.
         Current function value: 0.643629
         Iterations 6


In [20]:
# step 4: inspect results using summary stats

print(logit_res.summary())

                           Logit Regression Results                           
Dep. Variable:              DIRECTP_D   No. Observations:                 3749
Model:                          Logit   Df Residuals:                     3741
Method:                           MLE   Df Model:                            7
Date:                Wed, 08 Apr 2020   Pseudo R-squ.:                 0.07052
Time:                        12:29:44   Log-Likelihood:                -2413.0
converged:                       True   LL-Null:                       -2596.0
Covariance Type:            nonrobust   LLR p-value:                 4.323e-75
                        coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                -0.4457      0.407     -1.095      0.273      -1.243       0.352
REF_D                 0.7364      0.070     10.571      0.000       0.600       0.873
LOG_PRICE             0.

### Logistic Regression Observations (separated by IV)

*note: DV = log(DIRECTP_D)*

#### REF_D
- A one unit increase is associated with a 0.74 increase in DV
- P < 0.05 -> statistically significant
- Being positive likely means that if the customer visited a third party website they were more likely to purchase their tickets there instead

#### LOG_PRICE
- A one unit increase is associated with a 0.0014 increase in DV
- P > 0.05 -> **NOT statistically significant**
- Being negative hints at their not being a large difference in price between third party sites and the home site

#### TRANS_FREQ
- A one unit increase is associated with a 0.12 increase in DV
- P < 0.05 -> statistically significant
- More transactions per household likely means repeat customers that have positive/negative preferences about certaion sites

#### DURATION
- A one unit increase is associated with a 0.019 decrease in DV
- P < 0.05 -> statistically significant
- The longer individuals spent on the third party sight the more likely they were to purchase something

#### HOUSEHOLD_SIZE
- A one unit increase is associated with a 0.011 decrease in DV
- P > 0.05 -> **NOT statistically significant**
- It appears that having more individuals in a household does not garentee either using a home site or third party site

#### CHILDREN_D
- A one unit increase is associated with a 0.26 decrease in DV
- P < 0.05 -> statistically significant
- While having more individuals is not significant, having more children leads to a greater use of third party sights, likely due to kids deals

#### CONNECTIONSPEED_D
- A one unit increase is associated with a 0.049 increase in DV
- P > 0.05 -> **NOT statistically significant**
- Being able to connect to the internet is not related to being able to connect to individual sights as they would likely experience similar connection time


### Equation - minus all variables/constant with p > 0.05
DIRECTP_D = 0.74REF_D + 0.12TRANS_FREQ - 0.019DURATION + 0.26CHILDREN_D

## Part 1 Bonus:

- From our regression results and own personal experience, we believe that both HOUSEHOLD_SIZE and CONNECTIONSPEED_D can be   removed from the Independent Variable category. They are both Not statistically significant according to our model and from experience do not affect use of a home site or third pary site for booking.
- Two new Independent Variables we believe may have some correlation is PROD_NAME and PROD_QTY. Our reasoning involves being able to purchase different levels of packages that may accomidate many people. An example, groupon, allows individuals to purchase tickets through a third party website when they have large groups, where as a couple that only needs two tickets would be more likely to shop on the direct site.
- After looking into other regression models, we believe that a ridge regression model may be a useful alternate. It is very similar to a linear regression except for its use of a squared bias factor, which pulls in the "ridgess" of the data essentially reducing the variance of the data. This may be most useful for reigning in variable with high coefficients, like CHILDREN_D, or potentially the new variables we spoke to earlier.

***
# Part 2 - Poisson
Use Count Data (Poisson) regression model to answer...

**“What are the factors that influence people’s booking frequencies?”**

- DV = TRANS_FREQ
- IV(s) = REF_D, LOG_PRICE, PAGES_VIEWED, HOUSEHOLD_SIZE, CHILDREN_D, CONNECTIONSPEED_D

Report and interpret regression results -> include the interpretation of the regression coefficients.

In [21]:
# step 1: create x and y; add constant

x = sm.add_constant(data[['REF_D', 'LOG_PRICE', 'PAGES_VIEWED',
                          'HOUSEHOLD_SIZE', 'CHILDREN_D','CONNECTIONSPEED_D']])
y = data.TRANS_FREQ

In [22]:
# step 2: build model

Poisson_mod = sm.Poisson(y,x)

In [23]:
# step 3: fit the model

Poisson_res = Poisson_mod.fit()

Optimization terminated successfully.
         Current function value: 2.808406
         Iterations 6


In [24]:
# step 4: inspect results using summary stats

print(f'{Poisson_res.summary()}')

                          Poisson Regression Results                          
Dep. Variable:             TRANS_FREQ   No. Observations:                 3749
Model:                        Poisson   Df Residuals:                     3742
Method:                           MLE   Df Model:                            6
Date:                Wed, 08 Apr 2020   Pseudo R-squ.:                 0.01729
Time:                        12:29:58   Log-Likelihood:                -10529.
converged:                       True   LL-Null:                       -10714.
Covariance Type:            nonrobust   LLR p-value:                 5.967e-77
                        coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                 0.4063      0.168      2.422      0.015       0.078       0.735
REF_D                -0.2274      0.019    -11.745      0.000      -0.265      -0.189
LOG_PRICE             0.

### Poisson Regression Observations (separated by IV)

*note: DV = log(TRANS_FREQ)*

#### REF_D
- A one unit increase is associated with a 0.23 decrease in DV
  - As a result, the **frequency of transactions will decrease by 26%.**
- P < 0.05 -> statistically significant

#### LOG_PRICE
- A one unit increase is associated with a 0.0033 increase in DV
  - As a result, the **frequency of transactions will increase by 0.33%.**
- P > 0.05 -> **NOT statistically significant**

#### PAGES_VIEWED
- A one unit increase is associated with a 0.0023 increase in DV
  - As a result, the **frequency of transactions will increase by 0.23%.**
- P < 0.05 -> statistically significant

#### HOUSEHOLD_SIZE
- A one unit increase is associated with a 0.012 decrease in DV
  - As a result, the **frequency of transactions will decrease by 0.12%.**
- P < 0.05 -> statistically significant

#### CHILDREN_D
- A one unit increase is associated with a 0.23 decrease in DV
  - As a result, the **frequency of transactions will decrease by 0.26%.**
- P < 0.05 -> statistically significant

#### CONNECTIONSPEED_D
- A one unit increase is associated with a 0.90 increase in DV
  - As a result, the **frequency of transactions will increase by 146%.**
- P < 0.05 -> statistically significant  

### Equation - minus variables with p > 0.05

log(TRANS_FREQ) = 0.41 - 0.23REF_D + 0.0023PAGES_VIEWED - 0.012HOUSEHOLD_SIZE - 0.23CHILDREN_D + 0.90CONNECTIONSPEED_D

### Math behind (%) calculations

*exp = exponent; thus, exp(x) -> e^x)*

#### REF_D

(exp(0.23) - 1) * 100% = **26%**

#### LOG_PRICE

(exp(0.0033) - 1) * 100% = **0.33%**

#### PAGES_VIEWED

(exp(0.0023) - 1) * 100% = **0.23%**

#### HOUSEHOLD_SIZE

(exp(0.012) - 1) * 100% = **0.12%**

#### CHILDREN_D

(exp(0.23) - 1) * 100% = **26%**

#### CONNECTIONSPEED_D

(exp(0.90) - 1) * 100% = **146%**

***
# Part 3 - Negative Binomial
Use Negative Binomial Regression model to answer...

**“What are the factors that influence people’s booking frequencies?”**

- DV = TRANS_FREQ
- IV(s) = REF_D, LOG_PRICE, PAGES_VIEWED, HOUSEHOLD_SIZE, CHILDREN_D, CONNECTIONSPEED_D

Report and interpret regression results -> include the interpretation of the regression coefficients

In [25]:
# step 1: create x and y; add constant

x = sm.add_constant(data[['REF_D', 'LOG_PRICE', 'PAGES_VIEWED',
                          'HOUSEHOLD_SIZE', 'CHILDREN_D','CONNECTIONSPEED_D']])
y = data.TRANS_FREQ

In [26]:
# step 2: build model

NB_mod = sm.GLM(y,x, family=sm.families.NegativeBinomial())

In [27]:
# step 3: fit the model

NB_res = NB_mod.fit()

In [28]:
# step 4: inspect results using summary stats

print(f'{NB_res.summary()}')

                 Generalized Linear Model Regression Results                  
Dep. Variable:             TRANS_FREQ   No. Observations:                 3749
Model:                            GLM   Df Residuals:                     3742
Model Family:        NegativeBinomial   Df Model:                            6
Link Function:                    log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -8365.2
Date:                Wed, 08 Apr 2020   Deviance:                       2280.8
Time:                        12:30:04   Pearson chi2:                 5.24e+03
No. Iterations:                    12                                         
Covariance Type:            nonrobust                                         
                        coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                 0.3971      0.25

### Negative Binomial Regression Observations (separated by IV)

*note: alpha level of 2 was used aka default for NB*
*note: DV = log(TRANS_FREQ)*

#### REF_D
- A one unit increase is associated with a 0.22 decrease in DV
  - As a result, the **frequency of transactions will decrease by 25%.**
- P < 0.05 -> statistically significant

#### LOG_PRICE
- A one unit increase is associated with a 0.0005 increase in DV
  - As a result, the **frequency of transactions will increase by 0.05%.**
- P > 0.05 -> **NOT statistically significant**

#### PAGES_VIEWED
- A one unit increase is associated with a 0.0029 increase in DV
  - As a result, the **frequency of transactions will increase by 0.29%.**
- P < 0.05 -> statistically significant

#### HOUSEHOLD_SIZE
- A one unit increase is associated with a 0.0085 decrease in DV
  - As a result, the **frequency of transactions will decrease by 0.85%.**
- P > 0.05 -> **NOT statistically significant**

#### CHILDREN_D
- A one unit increase is associated with a 0.24 decrease in DV
  - As a result, the **frequency of transactions will decrease by 27%.**
- P < 0.05 -> statistically significant

#### CONNECTIONSPEED_D
- A one unit increase is associated with a 0.90 increase in DV
  - As a result, the **frequency of transactions will increase by 146%.**
- P < 0.05 -> statistically significant

### Equation - minus all variables/constant with p > 0.05

log(TRANS_FREQ) = - 0.22REF_D + 0.0029PAGES_VIEWED - 0.24CHILDREN_D + 0.90CONNECTIONSPEED_D

### Math behind (%) calculations

#### REF_D

[exp(0.22) - 1] * 100% = **25%**

#### LOG_PRICE

[exp(0.0005) - 1] * 100% = **0.05%**

#### PAGES_VIEWED

[exp(0.0029) - 1] * 100% = **0.29%**

#### HOUSEHOLD_SIZE

[exp(0.0085) - 1] * 100% = **0.85%**

#### CHILDREN_D

[exp(0.24) - 1] * 100% = **27%**

#### CONNECTIONSPEED_D

[exp(0.90) - 1] * 100% = **146%**

***
# Part 4 - Summary of Parts 2 and 3

Summarize observations by comparing the results from 2 and 3.

As the preferred model is contingent on the nature of the data, we need to check the nature of TRANS_FREQ by getting its summary statistics.

In [29]:
data['TRANS_FREQ'].describe()

count    3749.000000
mean        2.981328
std         4.120927
min         1.000000
25%         1.000000
50%         2.000000
75%         3.000000
max        30.000000
Name: TRANS_FREQ, dtype: float64

**Observations**

> mean = 2.98

> standard deviation = 4.12

> variance = (standard deviation)^2 = (4.12)^2 = 16.97

## Poisson assumes mean = variance. NB allows for mean < variance.

## **TRANS_FREQ has a smaller mean when compared to its variance** (2.98 and 16.97, respectively).

## Thus, **NB regression is more appropriate in modeling for the frequency of transactions**.