In [1]:
import warnings  
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import metrics
from env import get_db_url
import acquire


I have some questions for you that I need to be answered before the board meeting Thursday afternoon. I need to be able to speak to the following questions. I also need a single slide that I can incorporate into my existing presentation (Google Slides) that summarizes the most important points. My questions are listed below; however, if you discover anything else important that I didn’t think to ask, please include that as well.


1. Which lesson appears to attract the most traffic consistently across cohorts (per program)?
2. Is there a cohort that referred to a lesson significantly more than other cohorts seemed to gloss over?
3. Are there students who, when active, hardly access the curriculum? If so, what information do you have about these students?
4. Is there any suspicious activity, such as users/machines/etc accessing the curriculum who shouldn’t be? Does it appear that any web-scraping is happening? Are there any suspicious IP addresses?
5. At some point in 2019, the ability for students and alumni to access both curriculums (web dev to ds, ds to web dev) should have been shut off. Do you see any evidence of that happening? Did it happen before?
6. What topics are grads continuing to reference after graduation and into their jobs (for each program)?
7. Which lessons are least accessed?
8. Anything else I should be aware of?


In [2]:
df = acquire.get_curriculum_data()

In [3]:
df.head()

Unnamed: 0,date,time,path,user_id,ip,name,start_date,end_date,program_id
0,2018-01-26,09:55:03,/,1,97.105.19.61,Hampton,2015-09-22,2016-02-06,1.0
1,2018-01-26,09:56:02,java-ii,1,97.105.19.61,Hampton,2015-09-22,2016-02-06,1.0
2,2018-01-26,09:56:05,java-ii/object-oriented-programming,1,97.105.19.61,Hampton,2015-09-22,2016-02-06,1.0
3,2018-01-26,09:56:06,slides/object_oriented_programming,1,97.105.19.61,Hampton,2015-09-22,2016-02-06,1.0
4,2018-01-26,09:56:24,javascript-i/conditionals,2,97.105.19.61,Teddy,2018-01-08,2018-05-17,2.0


In [4]:
df.shape

(900223, 9)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 900223 entries, 0 to 900222
Data columns (total 9 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   date        900223 non-null  object 
 1   time        900223 non-null  object 
 2   path        900222 non-null  object 
 3   user_id     900223 non-null  int64  
 4   ip          900223 non-null  object 
 5   name        847330 non-null  object 
 6   start_date  847330 non-null  object 
 7   end_date    847330 non-null  object 
 8   program_id  847330 non-null  float64
dtypes: float64(1), int64(1), object(7)
memory usage: 68.7+ MB


In [6]:
# convert dates to date time type
df['date'] = pd.to_datetime(df['date']).dt.date
df['start_date'] = pd.to_datetime(df['start_date']).dt.date
df['end_date'] = pd.to_datetime(df['end_date']).dt.date

In [7]:
# assigns values in program_id their program name
df.loc[df['program_id'] == 1.0, 'program_id'] = 'Web Development'
df.loc[df['program_id'] == 2.0, 'program_id'] = 'Web Development'
df.loc[df['program_id'] == 4.0, 'program_id'] = 'Web Development'         
df.loc[df['program_id'] == 3.0, 'program_id'] = 'Data Science'

In [8]:
df.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
date,900223.0,1182.0,2021-03-19,3104.0,,,,,,,
time,900223.0,73167.0,09:03:00,81.0,,,,,,,
path,900222.0,2313.0,/,50313.0,,,,,,,
user_id,900223.0,,,,458.825707,249.296767,1.0,269.0,475.0,660.0,981.0
ip,900223.0,5531.0,97.105.19.58,284579.0,,,,,,,
name,847330.0,47.0,Staff,84031.0,,,,,,,
start_date,847330.0,44.0,2014-02-04,92921.0,,,,,,,
end_date,847330.0,45.0,2014-02-04,84031.0,,,,,,,
program_id,847330.0,2.0,Web Development,743918.0,,,,,,,


In [9]:
for col in df.columns:
    print(col)
    print(df[col].value_counts())
    print("-------------------------------")

date
2021-03-19    3104
2021-04-12    2446
2021-03-25    2369
2020-09-08    2304
2021-03-16    2298
              ... 
2018-12-29      32
2018-12-22      30
2018-12-30      21
2019-07-04      16
2018-12-23      10
Name: date, Length: 1182, dtype: int64
-------------------------------
time
09:03:00    81
09:01:59    79
09:02:45    75
09:02:16    75
09:05:45    74
            ..
04:58:23     1
04:58:29     1
04:58:30     1
04:58:31     1
07:28:59     1
Name: time, Length: 73167, dtype: int64
-------------------------------
path
/                                                               50313
search/search_index.json                                        19519
javascript-i                                                    18983
toc                                                             18297
java-iii                                                        13733
                                                                ...  
javascript/loops                                

In [10]:
df.isna().sum()

date              0
time              0
path              1
user_id           0
ip                0
name          52893
start_date    52893
end_date      52893
program_id    52893
dtype: int64

**Notes:** 
- What is path relevent to? -why only 1 null?
- Are the 52,893 rows with no name, start_date, end_date, program_id just nulls, or could that be something else?


In [11]:
df[df.name.isna()].user_id.value_counts()

354    2965
736    2358
363    2248
716    2136
368    2085
       ... 
644       6
663       4
62        4
89        3
176       3
Name: user_id, Length: 78, dtype: int64

In [12]:
df[df.name.isna()].ip.value_counts()

97.105.19.58       15931
70.117.16.60        1903
67.11.117.74        1729
70.94.165.107       1715
23.116.170.48       1616
                   ...  
72.181.117.212         1
173.149.221.121        1
96.8.179.87            1
172.124.66.235         1
172.58.111.220         1
Name: ip, Length: 413, dtype: int64

1. Which lesson appears to attract the most traffic consistently across cohorts (per program)?
    - Web Development = javascript-i
    - Data Science = search/search_index.json	

In [21]:
web_dev = df[(df.program_id=='Web Development') & (df.path!= '/')]
ds = df[(df.program_id=='Data Science') & (df.path!= '/')]

In [22]:
web_dev.path.value_counts()

javascript-i                                                    18193
toc                                                             17580
search/search_index.json                                        15331
java-iii                                                        13162
html-css                                                        13111
                                                                ...  
4-stats/2.7-correlation                                             1
4-python/error-handling                                             1
10-anomaly-detection/isolation-forests                              1
10-anomaly-detection/time-series-anomaly-detection-part-3           1
appendix/professional-development/post-interview-review-form        1
Name: path, Length: 2053, dtype: int64

In [23]:
ds.path.value_counts()

search/search_index.json                    2203
classification/overview                     1785
1-fundamentals/modern-data-scientist.jpg    1655
1-fundamentals/AI-ML-DL-timeline.jpg        1651
1-fundamentals/1.1-intro-to-data-science    1633
                                            ... 
python/custom-sorting-functions                1
imports                                        1
java-i/console-io                              1
appendix/univariate_regression_in_excel        1
6-regression/8-Project                         1
Name: path, Length: 681, dtype: int64

In [51]:
pd.set_option("display.max_rows", 2500)
ds.path.groupby(ds.name).value_counts()

name      path                                                                                                                                                                                                                                                                                                                  
Bayes     1-fundamentals/modern-data-scientist.jpg                                                                                                                                                                                                                                                                                   650
          1-fundamentals/AI-ML-DL-timeline.jpg                                                                                                                                                                                                                                                                                       648
          1-fundament

In [35]:
def value_counts_and_frequencies(s: pd.Series, dropna=True) -> pd.DataFrame:
    return pd.merge(
        s.value_counts(dropna=False).rename('count'),
        s.value_counts(dropna=False, normalize=True).rename('proba'),
        left_index=True,
        right_index=True,
    )


In [38]:
path_ds = value_counts_and_frequencies(ds.path)
path_ds.head()

Unnamed: 0,count,proba
search/search_index.json,2203,0.023176
classification/overview,1785,0.018779
1-fundamentals/modern-data-scientist.jpg,1655,0.017411
1-fundamentals/AI-ML-DL-timeline.jpg,1651,0.017369
1-fundamentals/1.1-intro-to-data-science,1633,0.01718


In [39]:
path_web_dev = value_counts_and_frequencies(web_dev.path)
path_web_dev.head()

Unnamed: 0,count,proba
javascript-i,18193,0.025754
toc,17580,0.024886
search/search_index.json,15331,0.021702
java-iii,13162,0.018632
html-css,13111,0.01856


2. Is there a cohort that referred to a lesson significantly more than other cohorts seemed to gloss over?

In [46]:
ds.path.groupby(ds.name).sum()

name
Bayes       3-sql/1-mysql-overview2-storytelling/bad-chart...
Curie       loginloginlogin4-python/1-overview1-fundamenta...
Darden      13-advanced-topics/1-tidy-data1-fundamentals/1...
Easley      python/data-types-and-variablesfundamentals/in...
Florence    fundamentals/intro-to-data-sciencefundamentals...
Name: path, dtype: object

In [47]:
web_dev.path.groupby(web_dev.name).sum()

name
Andromeda     assets/js/pdfmake.min.js.maptoctochtml-csshtml...
Apex          toctoctoctochtml-csshtml-css/introductionjava-...
Apollo        content/html-csscontent/html-css/gitbook/image...
Arches        javascript-ijavascript-i/functionsappendix/fur...
Badlands      prework/fundamentalsprework/fundamentalsprewor...
Bash          html-csshtml-csshtml-csstoctochtml-csshtml-css...
Betelgeuse    html-csstoctoctoctocspringtocappendixhtml-csst...
Ceres         tocquizjavascript-ispringjava-ijavascript-ijav...
Deimos        html-cssjavascript-ihtml-csshtml-csshtml-cssht...
Denali        mkdocs/search_index.jsonprework/databasesprewo...
Europa        toctoctoctoctoctoctoctochtml-csstoctoctocmysql...
Fortuna       tochtml-css/introductionhtml-css/elementstocto...
Franklin      java-iiijava-iii/user-inputjavascript-ijavascr...
Ganymede      java-itoctoctoctochtml-csshtml-css/introductio...
Glacier       prework/fundamentalsprework/databasesprework/d...
Hampton       java-iijava-ii/object