## Graphing customer activity & engagement levels

Given that we're a SaaS company that charges per seat for team subscriptions, user engagement is super important for maintaining our high retention levels. Also, because we do charge per seat, by measuring the relationship between team size and various engagement metrics like views, posts, and other actions. Should we be focusing more time on smaller accounts or only on the big fish?

### Table of Contents

1. [Understanding the Problem](#problem)
2. [Connecting to MongoDB](#connect)
3. [Extracting Data](#extracting)
5. [Exploratory Data Analysis](#eda) 
6. [Conclusion](#conclusion)

In [1]:
# Create the conda environment so you can run this notebook
!conda env create -f environment.yml
!conda activate dev


EnvironmentFileNotFound: '/Users/kyle/Kyso/cli-example/environment.yml' file not found


CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.




In [2]:
# Importing Libraries
import numpy as np
# import matplotlib.pyplot as plt
from IPython.display import set_matplotlib_formats
import yaml
# from pymongo import MongoClient
import urllib.parse
import pandas as pd
import plotly.offline as py
import plotly.graph_objects as go
import cufflinks as cf
from cufflinks import tools
import plotly.io as pio
from plotly.subplots import make_subplots
# import psutil
import plotly.express as px

# Notebook Configurations
pd.set_option('display.max_columns', 500)
pd.options.display.float_format = '{:.2f}'.format

ModuleNotFoundError: No module named 'plotly'

## 1. Understanding the Problem

To dive into these questions, we've pulled data on posts, comments and views from our MongoDB, to answer two questions:

1. What is the negligible cost of users' inactivity?
2. Does activity increase with team size? If so, is it signifcant, and does it impact the types of prospects our marketing and sales teams should be targeting?

## 2. Connecting to MongoDB

<div class="alert alert-block alert-warning"><b>Set up your MongoDB connection before uncommenting the code below</b> 

We first need to import a *secret.yml* file with our MongoDB credentials - username, password, and server. The *secret.yml* takes the format:

```
username: "your_username"
password: "your_password"
server: "@your_server"
```

**Please remember to add *'secret.yml'* file to your .gitignore!**

In [None]:
# # Import file secret.yml as cfg
# with open("secret.yml", 'r') as ymlfile: cfg = yaml.safe_load(ymlfile)

We can use pymongo to connect to the MongoDB instance. The connection (URI) string takes the following format:

``` "mongodb://username:password@server" ```

Use *urllib* in case of special characters. For example, if you use your email with '@' or any special characters in password, we recommend to use ```urllib.parse()```, as follows:

In [None]:
# # Configuring
# username = cfg['username']
# password = cfg['password']
# server = cfg['server']

# # Connecting to MongoDB server
# conn = MongoClient("mongodb+srv://" + urllib.parse.quote(username) + ":" + urllib.parse.quote(password) + server)

__Note:__ If your connection begins with "mongodb+srv:" you need to make sure to install dnspython with: ```python -m pip install dnspython```

## 3. Extracting Data

We have 3 connected collections in our database:

* users (and teams)
* posts
* comments

To extract data from mongodb to pandas, we have first to select a database:

In [None]:
# # Select database
# db = conn.user_activity

Then, extract each collection to a DataFrame, collection by collection. Example:

``` users = pd.DataFrame(list(db.users.find())) ```

We need to do this for each one of the 3 collections.

In [None]:
# # Extract Data from  MongoDB and convert to dataframe
# users = pd.DataFrame(list(db.users.find()))
# comments = pd.DataFrame(list(db.comments.find()))
# posts = pd.DataFrame(list(db.movies.find()))

# # Remember that is a good practice to close the connection to MongoDB after data extraction.
# conn.close()

**Remember to cache the data!**

Let's save to a cache file so we don't need to download the data every time we run the notebook. Some APIs can also have historical limits, so it's best to save/update the data every time it's pulled in.

In [None]:
# # Save to JSON
# users.to_json(r'cache/users_full.json')
# comments.to_json(r'cache/comments_full.json')
# posts.to_json(r'cache/posts_full.json')

# Open JSON datasets
users = pd.read_json('cache/users.json')
comments = pd.read_json('cache/comments.json')
posts = pd.read_json('cache/posts.json')

## 4. Exporatory Data Analysis

First, let's prepare the data. I combined the users, posts and comments datasets into a new dataset called ```teamactivity```, grouped by a team in order to analyze the metrics of activity per teams. Take a look at the dataset format:

In [None]:
# Combine datasets and grou by teams
teamactivity = pd.merge(users,comments, how='left', on=['user_id'])
teamactivity = pd.merge(teamactivity,posts, how='left', on=['user_id'])
teamactivity = teamactivity[['team_id','user_id', 
                             'posts_id_y','comment_id',
                             'views']].groupby(['team_id']).agg({
    'user_id': "nunique",  # team size
    'posts_id_y': "nunique", # posts
    'comment_id': "nunique", # comments
    'views': sum # views
})
teamactivity = teamactivity.rename(columns={'user_id': 'users', 'posts_id_y': 'posts','comment_id': 'comments'} )

teamactivity.sort_values('users', ascending=False).head(3)

There are 151 teams, which the size ranging from 1 to 505 users. It's more frequent to have teams in the range of 1-19 users. 

In [None]:
# Distribution of Team Size
fig = px.histogram(teamactivity, x="users", nbins=40,
                  labels={'count':'Teams','users':'Team size'},
                  title='Distribution of teams vs team size'
)
fig.update_layout(
    height=600,
    width=800,
    yaxis=dict(title='# of Teams')
)

fig.update_traces(marker=dict(line=dict(color='black', width=1.1), color='indianred'))

fig.show()

* At total, we are analyzing 150 teams at various sizes.
* Teams size range from 1 to 486 users.
* The average is 141 users per team.
* We have a lot more teams in the 1-40 users range.

## 5. Measuring Ativity Levels

Let's look for how the team size impact on the amount of knowledge sharing and engagement activities, by measuring:
* __Content Creation:__ the number of posts per team size
* __Views:__ the number of posts views per team size
* __Content Performance:__ the number of views per post
* __Team Engagement:__ the number of users that comments per team and per team size

In [None]:
# Posts per Team Size
layout = cf.Layout(
    height = 600,
    width = 800,
    yaxis = dict(title = '# of Posts'),
    xaxis = dict(title = 'Team Size'),
    title = 'Posts by Team'
)

fig = teamactivity.groupby(['users'],as_index=True)['posts'].sum().\
    iplot(kind='scatter',mode='markers+lines',size=8,asFigure=True,layout = layout, color='indianred')
fig.update_traces(marker=dict(line=dict(color='black', width=0.8)))
fig.show()

* As team size increases, the number of posts per team also increases in some exponential fashion.

In [None]:
# Posts Views per Team Size
layout = cf.Layout(
    height = 600,
    width = 800,
    yaxis = dict(title = '# of Views'),
    xaxis = dict(title = '# of Users'),
    title = 'Posts Views by Team Size'
)

fig = teamactivity.groupby(['users'],as_index=True)['views'].sum().\
    iplot(kind='scatter', mode='markers+lines',size=8,asFigure=True,layout = layout, color='indianred')
fig.update_traces(marker=dict(line=dict(color='black', width=0.8)))

fig.show()

* As team size increases, a higher number posts results in exponential growth in the number of post views.
* Users can view posts inside of their groups or any other public posts in platform that is open to any reader. 
* Visualizations are not unique. That means that, if a user views the same post 100 times, it will count as 100 views (not as 1).

In [None]:
# Posts vs Views
fig = px.scatter(x=teamactivity.posts, y=teamactivity.views,
                 labels={'x':'# of Posts', 'y':'# of Views'},
                  title='Posts vs Pageviews')
fig.update_layout(
    height=600,
    width=800
)
fig.update_traces(marker=dict(line=dict(color='black', width=0.8), color='indianred'))

fig.show()

* As the number of posts increases, the overall number of views naturally increase. 
* These views can represent both reach or impression. 
* A __reach__ is a metric that tells how many people are seeing your content. 
* __Impressions__ means it was displayed but may not have generated an engagement or comment. The peak of 4202 posts registered together almost 560 thousand views, which represents an average of 7.5 views per post.


### Team Engagement

In [None]:
# Comments per Team Size
layout = cf.Layout(
    height = 600,
    width = 800,
    yaxis = dict(title = '# of Comments'),
    xaxis = dict(title = 'Team Size'),
    title = 'Number of Comments by Team'
)
fig = teamactivity.groupby(['users'],as_index=True)['comments'].sum().\
    iplot(kind='scatter',mode='markers+lines',size=8,asFigure=True,layout = layout, color='indianred')
fig.update_traces(marker=dict(line=dict(color='black', width=0.8)))
fig.show()

* Some teams comment a lot more than others, and that is visualized in the peaks within each range of team size. 
* For example, the team in the middle with 387 and 394 users do more comments than the average on their range of users. 
* As the team size increases, it also exponentially increases team engagement. 
* If we sum all the comments made, the team with +500 users had an average of 60 comments per person considering all periods analyzed together.

## 6. Conclusion

* As the team size increases, so too do the total number of posts, comments, and views. 
* Naturally, as the number of posts increases, so too does the number of posts views. 
* __To increase activity levels__, implement features that encourage engagement - commenting, following, tagging, questions, feedback, etc...