# Email from the boss...
*I have some questions for you that I need to be answered before the board meeting Thursday afternoon. I need to be able to speak to the following questions. I also need a single slide that I can incorporate into my existing presentation (Google Slides) that summarizes the most important points. My questions are listed below; however, if you discover anything else important that I didn’t think to ask, please include that as well.*

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from wrangle import full_wrangle
import warnings
warnings.filterwarnings('ignore')

# reloads import files each time a cell is ranx
%load_ext autoreload
%autoreload 2



In [2]:
df, df_staff, df_multicohort, df_unimputed, df_non_curriculum, df_outliers = full_wrangle()

This returned the following dataframes (reassign if you missed any):


Unnamed: 0_level_0,Description,Record Count,Percent of Raw df
Dataframe,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
df,Fully cleaned dataframe,509409,56.6%
df_staff,Cohort == Staff,84031,9.33%
df_multicohort,Users listed in more than one cohort,22708,2.52%
df_unimputed,Users with unknown/unimputable cohorts,45904,5.1%
df_non_curriculum,"Accessess not related to the curriculum, i.e. directories, images",116539,12.9%
df_outliers,Accesses meeting outlier conditions,121626,13.5%


# Question 1 - Which lesson appears to attract the most traffic consistently across cohorts (per program)?

First we split the analysis dataframe amongst the two programs:

In [3]:
df_web = df[df.program_type == 'Web Development']
df_ds = df[df.program_type == 'Data Science']

In [4]:
df_web.shape, df_ds.shape

((451039, 11), (58370, 11))

## Examining Web Dev:

In [5]:
df_web.lesson.value_counts().nlargest(10)

Not Lesson                                                                   97362
mysql.tables                                                                  5477
javascript-i.introduction.working-with-data-types-operators-and-variables     5312
mysql.databases                                                               5169
javascript-i.functions                                                        5163
javascript-i.javascript-with-html                                             5157
html-css.elements                                                             5046
mysql.users                                                                   4858
html-css.css-ii.bootstrap-grid-system                                         4843
java-iii.jsp-and-jstl                                                         4840
Name: lesson, dtype: int64

*'Not Lesson' are number one, but those represent higher level accesses in the url tree that feed down to the specific lesson.  Those will be ignored.

Originally took a look at the top 5 lessons overall to see how they tracked through time to find the one with the most traffic.

However, because of name changes of urls, and cohorts with low numbers, this didn't prove to be effective.

In [41]:
x = pd.crosstab(df_ds.cohort, df_ds.lesson).T

In [51]:
for n in x.columns:
    print(x[n].nlargest(10))

lesson
1-fundamentals.1.1-intro-to-data-science    573
6-regression.1-overview                     411
10-anomaly-detection.1-overview             289
6-regression.5.0-evaluate                   284
5-stats.3-probability-distributions         266
5-stats.4.2-compare-means                   263
6-regression.7.0-model                      257
Not Lesson                                  253
6-regression.4.0-explore                    232
4-python.7.4.3-dataframes                   227
Name: Bayes, dtype: int64
lesson
6-regression.1-overview                     432
1-fundamentals.1.1-intro-to-data-science    344
3-sql.1-mysql-overview                      333
10-anomaly-detection.1-overview             267
4-python.8.4.3-dataframes                   205
4-python.8.4.4-advanced-dataframes          191
4-python.3-data-types-and-variables         177
5-stats.4.2-compare-means                   157
5-stats.2-simulation                        141
3-sql.7-functions                           139


##### I determined the best approach is the use a loop to find the top three lessons for each cohort, then count the occurances of the top lessons.

In [6]:
wd_cl = df_web.groupby('cohort').lesson.value_counts()

In [7]:
top_three_wd = []
counter = 0
name = wd_cl.index[0][0]
result_row = {}
for c in wd_cl.index:
    if c[0] != name:
        top_three_wd.append(result_row)
        counter = 0 
        result_row={}
    if counter < 3:
        if c[1] != 'Not Lesson':
            result_row['Cohort'] = c[0]
            result_row[f'#{counter+1} lesson'] = c[1]
            result_row[f'#{counter+1} lesson count'] = wd_cl[c]
            counter += 1
            name = c[0]
    else:
        name = c[0]

Also, created a function for this:

In [8]:
import explore_rm

In [9]:
wd_tl = explore_rm.lesson_top_three(df_web)
wd_tl

Top ten lessons:
----------
Not Lesson                                                                   97362
mysql.tables                                                                  5477
javascript-i.introduction.working-with-data-types-operators-and-variables     5312
mysql.databases                                                               5169
javascript-i.functions                                                        5163
javascript-i.javascript-with-html                                             5157
html-css.elements                                                             5046
mysql.users                                                                   4858
html-css.css-ii.bootstrap-grid-system                                         4843
java-iii.jsp-and-jstl                                                         4840
Name: lesson, dtype: int64


Unnamed: 0,Cohort,#1 lesson,#1 lesson count,#2 lesson,#2 lesson count,#3 lesson,#3 lesson count
0,Andromeda,javascript-i.introduction.working-with-data-types-operators-and-variables,263.0,mysql.tables,262.0,spring.fundamentals.repositories,259.0
1,Apex,java-i.syntax-types-and-variables,360.0,java-ii.object-oriented-programming,355.0,mysql.tables,348.0
2,Arches,javascript-i.loops,66.0,html-css.elements,65.0,javascript-ii.promises,63.0
3,Badlands,content.php_ii.command-line,6.0,content.php_i,5.0,content.php_ii.control-structures-i,5.0
4,Bash,javascript-i.introduction,260.0,javascript-i.javascript-with-html,213.0,javascript-i.introduction.working-with-data-types-operators-and-variables,211.0
5,Betelgeuse,html-css.elements,382.0,html-css.css-ii.bootstrap-grid-system,341.0,javascript-i.javascript-with-html,274.0
6,Ceres,javascript-i.introduction.working-with-data-types-operators-and-variables,372.0,html-css.forms,353.0,html-css.elements,351.0
7,Deimos,html-css.css-ii.bootstrap-introduction,340.0,html-css.css-ii.bootstrap-grid-system,324.0,mysql.tables,311.0
8,Europa,html-css.elements,340.0,html-css.css-i.selectors-and-properties,299.0,mysql.tables,285.0
9,Fortuna,mysql.tables,303.0,mysql.basic-statements,286.0,java-iii.jsp-and-jstl,275.0


In [10]:
a1 = wd_tl['#1 lesson'].value_counts()
b1 = wd_tl['#2 lesson'].value_counts()
c1 = wd_tl['#3 lesson'].value_counts() 
(a1 + b1 + c1).nlargest()

mysql.tables                                                                 11.0
html-css.elements                                                             7.0
javascript-i.introduction.working-with-data-types-operators-and-variables     7.0
javascript-i.javascript-with-html                                             4.0
dtype: float64

##### mysql.tables was the most accessed overall, and is the most consistently high amongst the cohorts.

### *Looking at unit:*

Created a function that creates a unit column from the path and applied it to the web analysis dataframe.

In [11]:
wd_tu = explore_rm.unit_top_three(df_web)
wd_tu

Top ten units:
----------
javascript-i     83241
html-css         59975
mysql            59947
jquery           43066
java-iii         39998
spring           39982
java-ii          38903
java-i           29129
javascript-ii    27918
examples          9765
Name: unit, dtype: int64


Unnamed: 0,Cohort,#1 unit,#1 unit count,#2 unit,#2 unit count,#3 unit,#3 unit count
0,Andromeda,javascript-i,3554,html-css,2598,mysql,2521
1,Apex,javascript-i,4175,mysql,3741,html-css,3452
2,Arches,javascript-i,921,html-css,626,javascript-ii,445
3,Badlands,content,57,prework,6,javascript-ii,5
4,Bash,javascript-i,3108,html-css,2141,mysql,2054
5,Betelgeuse,javascript-i,3952,html-css,2564,jquery,2205
6,Ceres,javascript-i,5794,html-css,5115,mysql,3765
7,Deimos,javascript-i,4444,html-css,4346,mysql,3567
8,Europa,html-css,3029,javascript-i,2918,mysql,2913
9,Fortuna,javascript-i,3969,mysql,3335,html-css,3196


In [12]:
a2 = wd_tu['#1 unit'].value_counts()
b2 = wd_tu['#2 unit'].value_counts()
c2 = wd_tu['#3 unit'].value_counts() 
(a2 + b2 + c2).nlargest()

javascript-i    30.0
html-css        23.0
mysql           22.0
spring           6.0
java-iii         5.0
dtype: float64

##### From a unit standpoint, javascript-i is the most accessed unit across curriculums, with mysql third.
### Overall, I would say "mysql/tables" is the most accessed lesson.

## Examing DS

In [13]:
ds_tl = explore_rm.lesson_top_three(df_ds)
ds_tl

Top ten lessons:
----------
classification.overview                     1310
1-fundamentals.1.1-intro-to-data-science    1270
classification.scale_features_or_not.svg    1138
sql.mysql-overview                          1008
fundamentals.intro-to-data-science           946
6-regression.1-overview                      848
10-anomaly-detection.1-overview              573
3-sql.1-mysql-overview                       543
anomaly-detection.overview                   523
stats.compare-means                          510
Name: lesson, dtype: int64


Unnamed: 0,Cohort,#1 lesson,#1 lesson count,#2 lesson,#2 lesson count,#3 lesson,#3 lesson count
0,Bayes,1-fundamentals.1.1-intro-to-data-science,573,6-regression.1-overview,411,10-anomaly-detection.1-overview,289
1,Curie,6-regression.1-overview,432,1-fundamentals.1.1-intro-to-data-science,344,3-sql.1-mysql-overview,333
2,Darden,classification.overview,858,classification.scale_features_or_not.svg,713,sql.mysql-overview,573
3,Easley,classification.scale_features_or_not.svg,283,classification.overview,265,fundamentals.intro-to-data-science,210


In [14]:
a3 = ds_tl['#1 lesson'].value_counts()
b3 = ds_tl['#2 lesson'].value_counts()
c3 = ds_tl['#3 lesson'].value_counts()
(a3 + b3 + c3).nlargest()

Series([], dtype: float64)

This did not work because it counts missing values as NaNs that wipe the whole row out.  Looking just at columns 1 and two:

In [15]:
(a3 + b3)

1-fundamentals.1.1-intro-to-data-science    2
6-regression.1-overview                     2
classification.overview                     2
classification.scale_features_or_not.svg    2
dtype: int64

##### Classification Overview is the is the highest overall, and also scores high across cohorts with introduction/fundamentals close behind.  Interestingly, an image ranked very highly - the svg - means it must be a very good image (not a lesson).  Regression overview and into also rank highly.
Interesting note: It seems the path syntax changed from Bayes to Easely, as evidenced by the fundamentals having two seperate high scoring lessons.

### *Looking at unit:*

In [16]:
ds_tu = explore_rm.unit_top_three(df_ds)
ds_tu

Top ten units:
----------
classification      5993
sql                 5239
3-sql               4819
python              4078
4-python            3793
6-regression        3523
fundamentals        3462
1-fundamentals      3079
5-stats             2732
7-classification    2476
Name: unit, dtype: int64


Unnamed: 0,Cohort,#1 unit,#1 unit count,#2 unit,#2 unit count,#3 unit,#3 unit count
0,Bayes,6-regression,2121,4-python,2031,3-sql,1927
1,Curie,3-sql,1897,4-python,1682,6-regression,1381
2,Darden,classification,3991,python,2051,sql,2046
3,Easley,classification,1348,sql,1169,fundamentals,915


In [17]:
a4 = ds_tu['#1 unit'].value_counts()
b4 = ds_tu['#2 unit'].value_counts()
c4 = ds_tu['#3 unit'].value_counts()
(a4 + b4 + c4).nlargest()

Series([], dtype: float64)

Also did not work since missing values wipe out the additions.  But overallthe two sql units, when combined, appear to have the highest number of accesses.

##### While the sql unit has the most number of accesses overall, there are also a lot of lessons within sql.
### Overall, I would say the "classification/overview" lesson is the most accessed.

## 7. Which lessons are least accessed?

There are lots of lessons with just a few accesses (many with just 1).  Therefore, we focused on unit for this question.

### Web:

In [29]:
web_units = df_web.groupby('cohort').unit.nunique().index
web_units

Index(['Andromeda', 'Apex', 'Arches', 'Badlands', 'Bash', 'Betelgeuse',
       'Ceres', 'Deimos', 'Europa', 'Fortuna', 'Franklin', 'Ganymede',
       'Glacier', 'Hampton', 'Hyperion', 'Ike', 'Joshua', 'Jupiter', 'Kalypso',
       'Kings', 'Lassen', 'Luna', 'Mammoth', 'Marco', 'Neptune', 'Niagara',
       'Oberon', 'Olympic', 'Pinnacles', 'Quincy', 'Sequoia', 'Teddy',
       'Ulysses', 'Voyageurs', 'Wrangell', 'Xanadu', 'Yosemite', 'Zion'],
      dtype='object', name='cohort')

In [30]:
setter = df_web[df_web.cohort == "Andromeda"].lesson.unique()
for i in web_units:
    unit_list = df_web[df_web.cohort == i].lesson.unique()
    setter = list(set(setter).intersection(unit_list)) 
setter

['Not Lesson']

In [33]:
df_web.unit.value_counts()[df_web.unit.value_counts()>50].nsmallest(10)

introduction        52
git                 52
elements            54
1-fundamentals      62
prework            695
index.html        1080
web-design        1407
capstone          1847
slides            5993
content           6576
Name: unit, dtype: int64

##### Not counting prework (or the index) I would say the least accessed unit in general is 'web-design'

In [21]:
setter = df_web[df_web.cohort == "Andromeda"].lesson.unique()
for i in web_units:
    unit_list = df_web[df_web.cohort == i].lesson.unique()
    setter = list(set(setter).intersection(unit_list)) 
setter

['Not Lesson']

### Data Science

In [22]:
ds_units = df_ds.groupby('cohort').unit.nunique().index

In [24]:
setter = df_ds[df_ds.cohort == "Bayes"].unit.unique()
for i in ds_units:
    unit_list = df_ds[df_ds.cohort == i].unit.unique()
    setter = list(set(setter).intersection(unit_list)) 
setter

['sql',
 'regression',
 'classification',
 'capstones',
 'anomaly-detection',
 'fundamentals',
 'python',
 'storytelling',
 'distributed-ml',
 'nlp',
 'timeseries',
 'clustering',
 '1-fundamentals',
 'stats']

In [25]:
df_ds.unit.value_counts()[df_ds.unit.value_counts()>50]

classification          5993
sql                     5239
3-sql                   4819
python                  4078
4-python                3793
6-regression            3523
fundamentals            3462
1-fundamentals          3079
5-stats                 2732
7-classification        2476
stats                   2423
regression              2093
11-nlp                  1536
8-clustering            1463
9-timeseries            1401
clustering              1325
2-storytelling          1188
timeseries              1113
10-anomaly-detection    1099
storytelling             998
nlp                      973
anomaly-detection        892
12-distributed-ml        818
13-advanced-topics       798
distributed-ml           425
advanced-topics          117
Name: unit, dtype: int64

##### Distributed ML seems to be the elast accessed.

In [26]:
asdf = df_ds.groupby('cohort').lesson.nunique().index

In [35]:
setter = df_ds[df_ds.cohort == "Bayes"].lesson.unique()
for i in asdf:
    unit_list = df_ds[df_ds.cohort == i].lesson.unique()
    setter = list(set(setter).intersection(unit_list)) 
setter

['python.data-types-and-variables',
 'python.imports',
 '1-fundamentals.1.1-intro-to-data-science',
 'classification.decision-trees',
 'stats.probability-distributions',
 'classification.explore',
 'fundamentals.visualization-with-excel',
 'sql.more-exercises',
 'python.dataframes',
 'classification.prep',
 'classification.scale_features_or_not.svg',
 'anomaly-detection.overview',
 'fundamentals.data-science-pipeline',
 'storytelling.understand',
 'fundamentals.functions',
 'classification.acquire',
 'sql.tables',
 'stats.simulation',
 'fundamentals.vocabulary',
 'python.advanced-dataframes',
 'fundamentals.cli.more-topics',
 'fundamentals.git',
 'storytelling.create',
 'python.pandas-overview',
 'sql.limit',
 'capstones.capstones',
 'fundamentals.spreadsheets-overview',
 'regression.project',
 'classification.evaluation',
 'sql.indexes',
 'fundamentals.environment-setup',
 'python.intro-to-matplotlib',
 'timeseries.overview',
 'fundamentals.cli.listing-files',
 'regression.overview',


In [36]:
qw = []
for n in setter:
    row = {}
    row['Lesson'] = n
    row['Count'] = df[df.lesson == n].accessed.count()
    qw.append(row)
pd.DataFrame(qw).set_index('Lesson').Count.nsmallest(10)

Lesson
capstones.capstones                                46
storytelling.creating-charts                       56
fundamentals.cli.more-topics                       60
fundamentals.cli.moving-files                      70
storytelling.connecting-to-data                    78
nlp.overview                                       79
fundamentals.cli.navigating-the-filesystem         82
fundamentals.cli.file-paths                        83
fundamentals.cli.creating-files-and-directories    93
storytelling.tableau                               96
Name: Count, dtype: int64

Capstones is the smallest

----