# Data Analysis Interview Challenge

## Part 1 ‐ Exploratory data analysis

The attached logins.json file contains (simulated) timestamps of user logins in a particular geographic location. Aggregate these login counts based on 15­minute time intervals, and visualize and describe the resulting time series of login counts in ways that best characterize the underlying patterns of the demand. Please report/illustrate important features of the demand, such as daily cycles. If there are data quality issues, please report them.


In [26]:
import json
import requests
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import plotly.graph_objects as go

logins = pd.read_json('/Users/pedrorodriguez/Desktop/Springboard/Chapters/25-  Data Science Interview Process/ultimate_challenge/logins.json')

logins.info()

logins.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 93142 entries, 0 to 93141
Data columns (total 1 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   login_time  93142 non-null  datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 727.8 KB


Unnamed: 0,login_time
0,1970-01-01 20:13:18
1,1970-01-01 20:16:10
2,1970-01-01 20:16:37


In [2]:
intervals_15 = logins.resample('15Min', on= 'login_time').count()
intervals_15 = intervals_15.rename(columns ={'login_time': '15_min_interval'})
intervals_15 = intervals_15.reset_index()
intervals_15.head(3)

Unnamed: 0,login_time,15_min_interval
0,1970-01-01 20:00:00,2
1,1970-01-01 20:15:00,6
2,1970-01-01 20:30:00,9


In [3]:
fig= px.line(intervals_15, x= 'login_time', y= '15_min_interval')
fig.show()

The logins with 15 min intervals show the highest logins registered was on March 1, at 4:30 with 73 logins. 

To better understand the logins registered, I will group the logins by day to identify any patterns.

In [4]:
intervals_15['month'] = intervals_15['login_time'].dt.month.astype(object)
intervals_15['day'] = intervals_15['login_time'].dt.day.astype(object)


In [32]:
int_15_daily = intervals_15.resample('D', on= 'login_time').sum()
int_15_daily = int_15_daily.rename(columns ={'15_min_interval': 'daily_login'})
int_15_daily = int_15_daily.reset_index()
int_15_daily.describe()

Unnamed: 0,daily_login
count,103.0
mean,904.291262
std,347.167463
min,112.0
25%,643.0
50%,827.0
75%,1141.0
max,1889.0


In [6]:
fig= px.line(int_15_daily, x= 'login_time', y= 'daily_login')
fig.update_layout(title= 'Daily Logins Registered', xaxis_title= 'Days', yaxis_title= 'Logins')

The line plot shows as the logins increase through the year until April 13, they drop drastically to 395 logins. The day before had 1409 logins registered. Furthermore, we can see that the variance also increases over time.

I will group by weekly logins to see if there is any pattern. 

In [23]:
int_15_daily = intervals_15.resample('D', on= 'login_time').sum()
int_15_daily = int_15_daily.rename(columns ={'15_min_interval': 'daily_login'})
int_15_daily['mean'] = intervals_15.resample('D', on= 'login_time').mean()
int_15_daily['std'] = intervals_15.resample('D', on= 'login_time').std()
int_15_daily = int_15_daily.reset_index()
int_15_daily.head()

Unnamed: 0,login_time,daily_login,mean,std
0,1970-01-01,112,7.0,5.291503
1,1970-01-02,681,7.09375,5.074349
2,1970-01-03,793,8.260417,5.75325
3,1970-01-04,788,8.208333,7.240117
4,1970-01-05,459,4.78125,3.745041


In [30]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=int_15_daily['login_time'], y=int_15_daily['mean'],
                    mode='lines',
                    name='mean'))
fig.add_trace(go.Scatter(x=int_15_daily['login_time'], y=int_15_daily['std'],
                    mode='lines+markers',
                    name='Standard Deviation'))
fig.update_layout(title= 'Weekly Means and Standard Deviation logins registered', xaxis_title= 'Date', yaxis_title= 'Means and STD')
fig.show()

The graph shows the logins registered average increase through the year and increases the standard deviation, making the logins consistent throughout the year. 

In [7]:
int_15_week = intervals_15.resample('W', on= 'login_time').sum()
int_15_week = int_15_week.rename(columns ={'15_min_interval': 'weekly_login'})
int_15_week = int_15_week.reset_index()
int_15_week.head()

Unnamed: 0,login_time,weekly_login
0,1970-01-04,2374
1,1970-01-11,5217
2,1970-01-18,5023
3,1970-01-25,4751
4,1970-02-01,4744


In [8]:
fig2= px.line(int_15_week, x= 'login_time', y= 'weekly_login')
fig2.update_layout(title= 'Weekly logins made', xaxis_title= 'Weeks', yaxis_title= 'Logins')

Like before, the logins registered looks as they increase through the year until the week of March 22, with 8,955 logins registered. And start decreasing until the week of April 19 with 395. 

The last week always seems to be the lowest logins registered; I'm going to investigate if this is because of data shortage.

In [11]:
intervals_15['Date'] = [d.date() for d in intervals_15['login_time']]
intervals_15['Date'] = pd.to_datetime(intervals_15['Date'])
intervals_15.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9788 entries, 0 to 9787
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   login_time       9788 non-null   datetime64[ns]
 1   15_min_interval  9788 non-null   int64         
 2   month            9788 non-null   object        
 3   day              9788 non-null   object        
 4   Date             9788 non-null   datetime64[ns]
dtypes: datetime64[ns](2), int64(1), object(2)
memory usage: 382.5+ KB


In [24]:
count = intervals_15.groupby(['Date']).count()
count = count.reset_index()
count.head(3)

Unnamed: 0,Date,login_time,15_min_interval,month,day
0,1970-01-01,16,16,16,16
1,1970-01-02,96,96,96,96
2,1970-01-03,96,96,96,96


In [25]:
count.tail(3)

Unnamed: 0,Date,login_time,15_min_interval,month,day
100,1970-04-11,96,96,96,96
101,1970-04-12,96,96,96,96
102,1970-04-13,76,76,76,76


It seems the amount of data per day is 96, but on January 1, the data only have 16 entry and on April 13, have 76 entrances. This means the result will be low in these two days. 

## Summary

After downloaded the logins data, I group by 15 min intervals. The dataset has 96 entries per day except January 1, 1970, with 16 and April 13, 1970, with 76, meaning these two days will be low in the analysis.  On average, the company has 904 logins per day with a Standard Deviation of 347. I discovered the logins increase through the year with max logins on April 4 with 1,889 logins registered in the study, but the standard deviation also increases. 