## Using Scattertext to Analyze PyData Talks
Let's pull titles abstracts and descriptions of PyData talks to see how novice-level talks differed from intermediate and advanced talks.

Please check out Scattertext on Github: https://github.com/JasonKessler/scattertext

In [2]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re, time
import pygal
import scattertext as st
from IPython.display import IFrame
from IPython.core.display import display, HTML
import seaborn as sns
display(HTML("<style>.container { width:98% !important; }</style>"))
import spacy
import scattertext as st
%matplotlib inline

## First, let's scrape pydata.org

In [23]:
sched.to_csv('pydata_talks.csv', index=False)

In [5]:
sched = pd.read_csv('pydata_talks.csv')

In [7]:
nlp = spacy.load('en')

In [8]:
sched = sched[~sched['title'].isin(['BoF', 'Unconference Presentation'])]

In [9]:
sched['is_novice'] = (sched.level == 'Novice').apply(lambda x: 'Novice' if x else 'Not Novice')

In [10]:
sched['parse'] = (sched['title'] + '\n \n' + sched['abstract'].fillna('') + '\n \n' + sched['description'].fillna('')).apply(nlp)

In [11]:
sched = sched.loc[sched['title'].drop_duplicates().index]

In [12]:
sched

Unnamed: 0,abstract,author,description,level,location,title,year,is_novice,parse
0,Topics to be covered include ...\n\nCognitive ...,Dave DeBarr,We will review tutorial examples of using CNTK...,Intermediate,seattle,Using CNTK's Python Interface for Deep Learning,2017,Not Novice,"(Using, CNTK, 's, Python, Interface, for, Deep..."
1,Jupyter is great for tinkering and research. B...,Pavlo Andriychenko,I will show the tools and processes for buildi...,Experienced,london,Make your research interactive with Jupyter Da...,2017,Not Novice,"(Make, your, research, interactive, with, Jupy..."
2,What will we do in the workshop\n\nReading CSV...,Eduard Goma,Introductory workshop to show the first steps ...,Novice,barcelona,Introduction to data analysis with Pandas,2017,Novice,"(Introduction, to, data, analysis, with, Panda..."
3,Pandas is the Swiss-Multipurpose Knife for Dat...,Alexander Hendorf,Pandas is the Swiss-Multipurpose Knife for Dat...,Novice,berlin,Introduction to Data-Analysis with Pandas,2017,Novice,"(Introduction, to, Data, -, Analysis, with, Pa..."
4,The tutorial will introduce users to the core ...,Skipper Seabold,Dask is a relatively new library for parallel ...,Novice,dc,Using Dask for Parallel Computing in Python,2016,Novice,"(Using, Dask, for, Parallel, Computing, in, Py..."
...,...,...,...,...,...,...,...,...,...
4409,The robot detection module that I will present...,Eszter Windhager-Pokol,"In this talk, I will present the robot detecti...",Intermediate,london,Robot detection in IT environments,2016,Not Novice,"(Robot, detection, in, IT, environments, \n \n..."
4410,Imagine you are in London and want to travel s...,Nikolai Nowaczyk,Spherical Voronoi diagrams partition the surfa...,Intermediate,london,Spherical Voronoi Diagrams in Python,2016,Not Novice,"(Spherical, Voronoi, Diagrams, in, Python, \n ..."
4411,"Traditionally, risk scoring frameworks are bui...",Natalia Angarita-Jaimes,We will talk about a framework we have develop...,Intermediate,london,Assurance Scoring: Using Machine Learning and ...,2016,Not Novice,"(Assurance, Scoring, :, Using, Machine, Learni..."
4412,Jupyter Notebooks are code-centric documents i...,Thomas Kluyver,nbconvert is a set of tools to convert Jupyter...,Experienced,london,Customising nbconvert: how to turn Jupyter not...,2016,Not Novice,"(Customising, nbconvert, :, how, to, turn, Jup..."


## Let's see how descriptions of novice-directed talks sound compared to directed at more seasoned audiences

In [12]:
html = st.produce_scattertext_explorer(st.CorpusFromParsedDocuments(sched, category_col = 'is_novice', parsed_col = 'parse').build(),
                                       category='Novice',
                                       category_name='Novice',
                                       not_category_name='Intermediate or Advanced',
                                       minimum_term_frequency=8,
                                       pmi_threshold_coefficient=10,
                                       width_in_pixels=1000,
                                       term_ranker=st.OncePerDocFrequencyRanker,
                                       use_full_doc=True,
                                       metadata=sched['author'] + ' ('+sched['location'] + ', '+ sched['level'] + ')',)
file_name = 'output/PydataNoviceVsNotNovice.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)

## Let's see how the experiened talk descriptions sound

In [23]:
sched['is_advanced'] = (sched.level == 'Experienced').apply(lambda x: 'Experienced' if x else 'Not Experienced')
html = st.produce_scattertext_explorer(st.CorpusFromParsedDocuments(sched, category_col = 'is_advanced', parsed_col = 'parse').build(),
                                       category='Experienced',
                                       category_name='Experienced',
                                       not_category_name='Not Experienced',
                                       minimum_term_frequency=8,
                                       pmi_filter_thresold=8,                                       
                                       width_in_pixels=1000,
                                       term_ranker=st.OncePerDocFrequencyRanker,
                                       use_full_doc=True,
                                       metadata=sched['author'] + ' ('+sched['location'] + ', '+ sched['level'] + ')',)
file_name = 'output/PydataAdvancedVsRest.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)