# Select a subset of participants

Data on participants consists of the following:

### Ranges
* Age
* Name
* Yrs Loan Taken

### Categories
* Ethnicity
* Income
* Employment
* Education
* State

### Binary
* Completed Degree
* Hispanic/Latino
* Currently Have Student Loan
* Primary Person Making Repayment
* Loan Balance <32K
* Ever Missed a Payment
* Having Difficulty Making Payments

Some data is missing from the original pdf tables this spreadsheet was created from. In cases where data is missing, the value is "?"

## Imports and Data

In [1]:
import numpy as np
import pandas as pd
import qgrid

participants_df = pd.read_excel('Participants.xlsx', header=None , names = ['Name', 
    'PID', 'State', 'Gender', 'Age', 'Education', 'Completed Degree', 'Employment', 'Income', 'Ethnicity',
    'Hispanic/Latino', 'Currently Have Student Loan', 'Primary Person Making Repayment', 
    'Loan balance <$32K', 'Year Loan Taken', 'Ever Missed a Payment', 'Having Difficulty Making Payments'])

answers_df = pd.read_excel('Answers_Edited_530.xlsx', sheet_name = "Answers" , names = ['PID', 'TQID', 'AID', 'Answer', 'Code'] )

## QGrid for Filtering Data by Participants

Use the table below to select a subset of participants which you can then run subsequent analysis on. Click the filter on a given column's header to select ranges or specific values for each column

In [2]:
participants_qgrid = qgrid.show_grid(participants_df, show_toolbar=True)
participants_qgrid

QgridWidget(grid_options={'fullWidthRows': True, 'syncColumnCellResize': True, 'forceFitColumns': True, 'defau…

## Generate Dataframe Filtered by Participant

The code below will generate a new dataframe based on the filters selected in the cell above. This dataframe will allow users to identify the set of answers data from the Answers.xlsx spreadsheet corresponding to selected participants.

In [3]:
participants_filtered_df = participants_qgrid.get_changed_df()
participants_filtered_df

Unnamed: 0,Name,PID,State,Gender,Age,Education,Completed Degree,Employment,Income,Ethnicity,Hispanic/Latino,Currently Have Student Loan,Primary Person Making Repayment,Loan balance <$32K,Year Loan Taken,Ever Missed a Payment,Having Difficulty Making Payments
0,Aliana,1,california,F,31,technical degree,yes,Full Time,$50-99K,-,Yes,Yes,Yes,No,2012,Yes,Yes
1,Arianne,2,california,F,35,some graduate school,yes,Part Time,$50-99K,Caucasian,No,Yes,Yes,No,2009,Yes,Yes
2,Brandon,3,california,M,37,graduate degree,yes,Full Time,$100-149K,Caucasian,No,Yes,Yes,No,2008 2009,Yes,Yes
3,Bryan,4,california,M,37,college graduate,yes,Full Time,$50-99K,Caucasian,No,Yes,Yes,No,2012,No,No
4,Christine,5,california,F,31,technical degree,yes,Unemployed,<$30K,African American,No,Yes,Yes,No,2012 2013,Yes,Yes
5,Dana,6,california,F,23,college graduate,yes,Part Time,<$30K,African American,No,Yes,Yes,Yes,2013,Yes,Yes
6,Dara,7,california,F,34,associate degree,yes,Full Time,$30-49K,African American,No,Yes,Yes,Yes,2006,Yes,Yes
7,Paul,8,california,M,32,college graduate,yes,Full Time,$30-49K,Asian,No,Yes,Yes,Yes,2008 2015,No,Yes
8,Jonathan,9,california,M,30,college graduate,yes,Full Time,$30-49K,Caucasian,No,Yes,Yes,Yes,2013 2014,Yes,Yes
9,Paz,10,california,M,40,associate degree,no,Full Time,$50-99K,-,Yes,Yes,Yes,Yes,2009,No,No


Create an object `qgrid_pids` which lists the selects PIDs and is then passed to locate values in Answers.xlsx

In [4]:
qgrid_pids = participants_filtered_df['PID'].values

In [5]:
specific_participants_answers_df = answers_df.loc[answers_df['PID'].isin(qgrid_pids)]

## QGrid for Filtering Answers Data by TQID, Code

This grid allows us to apply new filters to the Answers.xlsx dataset. Previously, we have selected specific participants from the database using a variety of filters. Now, with the dataset which only includes those participants we selected, we can filter the TQID, code, or combination of those which we are interested in viewing summary statistics and other data visualizations on.

In [6]:
answers_qgrid = qgrid.show_grid(specific_participants_answers_df, show_toolbar=True)
answers_qgrid

QgridWidget(grid_options={'fullWidthRows': True, 'syncColumnCellResize': True, 'forceFitColumns': True, 'defau…

## Generate Dataframe Filtered by TQID and Code

The following cell will generate a dataframe based on the filters which have been applied in the TQID and Code fields.

In [7]:
answers_filtered_df = answers_qgrid.get_changed_df()
answers_filtered_df

Unnamed: 0,PID,TQID,AID,Answer,Code
0,1,T1Q1,1,Well it looks like a response,expectation of content
1,1,T1Q1,2,"obviously to the loan from Argonaut,",expectation of content
2,1,T1Q1,3,"that's who I spoke with, so ...",reason for choice
3,1,T1Q1,4,"It's explaining the different payment plans,",content
4,1,T1Q1,5,but it's not really helping me.,lack of understanding
5,1,T1Q1,6,At the bottom it says I can have the deferment...,informed of options
6,1,T1Q1,7,", that would be more useful to me",preference for options
7,1,T1Q1,8,. I guess I can go to Argonaut.com,action: go to website
8,1,T1Q1,9,to change the payment plan or postpone it.,looking at options
9,1,T1Q2,1,I would check to see how long I have,looking at options


TODO: Make a cell which displays the filters that have been applied for double checking

In [8]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/tempadmin/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [10]:
answers_filtered_df['Answer'] = answers_filtered_df['Answer'].astype(str)

In [11]:
answers_filtered_df['tokens'] = answers_filtered_df.apply(lambda row: nltk.word_tokenize(row['Answer']), axis=1)

In [12]:
answers_filtered_df.head()

Unnamed: 0,PID,TQID,AID,Answer,Code,tokens
0,1,T1Q1,1,Well it looks like a response,expectation of content,"[Well, it, looks, like, a, response]"
1,1,T1Q1,2,"obviously to the loan from Argonaut,",expectation of content,"[obviously, to, the, loan, from, Argonaut, ,]"
2,1,T1Q1,3,"that's who I spoke with, so ...",reason for choice,"[that, 's, who, I, spoke, with, ,, so, ...]"
3,1,T1Q1,4,"It's explaining the different payment plans,",content,"[It, 's, explaining, the, different, payment, ..."
4,1,T1Q1,5,but it's not really helping me.,lack of understanding,"[but, it, 's, not, really, helping, me, .]"


In [13]:
from nltk.probability import FreqDist
from nltk.corpus import stopwords


answers_list = list(answers_filtered_df['tokens'])
flat_answers_list = [item for sublist in answers_list for item in sublist]
flat_answers_list
answers_fd = FreqDist(flat_answers_list)

In [14]:
answer_bigrams = nltk.bigrams(flat_answers_list)

TODO: Make the below code count bigrams rather than simply list them all

In [15]:
list(answer_bigrams)

[('Well', 'it'),
 ('it', 'looks'),
 ('looks', 'like'),
 ('like', 'a'),
 ('a', 'response'),
 ('response', 'obviously'),
 ('obviously', 'to'),
 ('to', 'the'),
 ('the', 'loan'),
 ('loan', 'from'),
 ('from', 'Argonaut'),
 ('Argonaut', ','),
 (',', 'that'),
 ('that', "'s"),
 ("'s", 'who'),
 ('who', 'I'),
 ('I', 'spoke'),
 ('spoke', 'with'),
 ('with', ','),
 (',', 'so'),
 ('so', '...'),
 ('...', 'It'),
 ('It', "'s"),
 ("'s", 'explaining'),
 ('explaining', 'the'),
 ('the', 'different'),
 ('different', 'payment'),
 ('payment', 'plans'),
 ('plans', ','),
 (',', 'but'),
 ('but', 'it'),
 ('it', "'s"),
 ("'s", 'not'),
 ('not', 'really'),
 ('really', 'helping'),
 ('helping', 'me'),
 ('me', '.'),
 ('.', 'At'),
 ('At', 'the'),
 ('the', 'bottom'),
 ('bottom', 'it'),
 ('it', 'says'),
 ('says', 'I'),
 ('I', 'can'),
 ('can', 'have'),
 ('have', 'the'),
 ('the', 'deferment'),
 ('deferment', 'I'),
 ('I', 'guess'),
 ('guess', 'or'),
 ('or', 'the'),
 ('the', 'forbearance'),
 ('forbearance', ','),
 (',', 'that

TODO: Make the charts below display relevant information and give some interactive options for the user

In [16]:
%matplotlib nbagg
import matplotlib.pyplot as plt


answers_filtered_df = answers_qgrid.get_changed_df()
x = answers_filtered_df.index
y = answers_filtered_df['PID']


fig, ax = plt.subplots()
scatter, = ax.plot(x,y,ms=8,color='b',marker='o',ls='')

def handle_filter_changed(event, widget):
    answers_filtered_df = answers_qgrid.get_changed_df()
    x = answers_filtered_df.index
    y = answers_filtered_df['PID']
    fig.canvas.draw()
    scatter.set_data(x, y)
    fig.canvas.draw()

answers_qgrid.on('filter_changed', handle_filter_changed)


x = answers_filtered_df.index
y = answers_filtered_df['PID']

<IPython.core.display.Javascript object>

In [17]:
answers_qgrid

QgridWidget(grid_options={'fullWidthRows': True, 'syncColumnCellResize': True, 'forceFitColumns': True, 'defau…