# Project 3: Web APIs & NLP -01

## Background

Trip.com Group Limited is a leading global one-stop travel platform, integrating a comprehensive suite of travel products and services and differentiated travel content.  

The Company provides one-stop travel services through Ctrip and Qunar:  
 - Our accommodation business provides over 1.2 million global accommodation offerings, covering hotels, motels, resorts, homes, apartments, bed and breakfasts, hostels, and other properties.  
 - Our air ticketing business offers flights from over 480 airlines, covering over 2,600 airports in over 200 countries and regions.  
 - We also offer over 310,000 in-destination activities around the world.    

The Company provides travel services to non-Chinese customers mainly through Trip.com and Skyscanner.

Vision "to be the world's leading and most trusted family of online travel brands that aspires to deliver the perfect trip at the best price for every traveler."

The goal is to penetrate America especially North America and we plan to use be more visible on search engine thus marketing started keyword research. Keyword research is the process of finding and analyzing search terms that people enter into search engines with the goal of using that data for a specific purpose, often for search engine optimization (SEO) or general marketing. Keyword research can uncover queries to target, the popularity of these queries, their ranking difficulty, and more. In order to do so, we will need to identify keys words especially long-tail keywords that can get to the customers. 

In this project, we will be using Reddit to pull from r/Flights and r/Hotels 5000 entrys each to build a text classification model to help the marketing team identify which word should be for flight domain and which word should be for hotel domain. The model will also help to identify the top few words with high coeficient to help the marketing team see if we can identify long-tail keywords. 

Text classification gives you a richer SEO audit of the words that characterize your page content. Exploring word frequency against content in multiple pages will lead to more decisive SEO insights into inserting the words meant to be emphasized in a search query.

## Problem Statement

Do a Keyword research to uncover queries to target and the popularity of these queries using supervised learning technique so we’ll need some labeled data to train our model. 


## Data Collection

In [1]:
#import libraries
import time
import pandas as pd
import numpy as np
import requests 

In [2]:
# Define the base urls for submissions/comments from the reddit api
baseurl = 'https://api.pushshift.io/reddit/search/submission'

### Create Functions

In [3]:
# Define a function to get new parameters for the preceding 500 posts
def get_params(base_df, subreddit):
    params = {
        'subreddit': subreddit, 
        'size': 500, 
        'before': base_df.loc[(base_df.shape[0] - 1), 'created_utc'] 
    }
    return params

In [4]:
# Define a function that returns a list of dictionaries for the content of each post
def get_posts(params, baseurl='https://api.pushshift.io/reddit/search/submission'):
    res = requests.get(baseurl, params)
    if res.status_code != 200:
        return f'Error! Status code: {res.status_code}'
    else:
        posts = res.json()['data']
    return posts

In [5]:
# Define a function to turn the list of posts into a DataFrame
def create_new_df(posts):
    return pd.DataFrame(posts)

In [6]:
# Define a function to update the base DataFrame with the 500 succeeding posts
def update_df(base_df, subreddit):
    params = get_params(base_df, subreddit)
    # print(len(posts))
    df2 = create_new_df(get_posts(params))
    # print(df2.shape)
    updated = pd.concat([base_df, df2], axis=0, ignore_index=True, sort=True)
    return updated

### Collecting Flights data:

In [7]:
# Set up url parameters for the first pull from the Beatles subreddit (first 500 posts)
params_flights = {
    'subreddit': 'flights', 
    'size': 500
}

In [8]:
# Create a base dataframe from the posts
df_flights = create_new_df(get_posts (params_flights))

In [9]:
# Look at the shape (rows, columns)
df_flights.shape

(499, 93)

In [10]:
df_flights.head()

Unnamed: 0,subreddit,selftext,author_fullname,gilded,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,...,retrieved_utc,updated_utc,utc_datetime_str,post_hint,url_overridden_by_dest,preview,media_metadata,poll_data,is_gallery,gallery_data
0,Flights,"To date, I just purchased my cheapest internat...",t2_jn6mtl1y,0,$9AUD Flight,[],r/Flights,False,6,,...,1678772394,1678772395,2023-03-14 05:39:36,,,,,,,
1,Flights,,t2_9wpssbnd,0,I'm an engineer with a design to reverse engin...,[],r/Flights,False,6,images,...,1678761438,1678761439,2023-03-14 02:37:06,image,https://i.redd.it/8p6p65e2amna1.png,{'images': [{'source': {'url': 'https://previe...,,,,
2,Flights,[removed],t2_t7tp8zm5,0,TAP AIR FLIGHT SONG,[],r/Flights,False,6,,...,1678756708,1678756709,2023-03-14 01:18:15,,,,,,,
3,Flights,Context: I got a flight centre gift card for $...,t2_3kl6aczr,0,what would you do? awful rule regarding Porter...,[],r/Flights,False,6,,...,1678753915,1678753916,2023-03-14 00:31:38,,,,,,,
4,Flights,[removed],t2_4rayi7px,0,EVA Air CC Verification,[],r/Flights,False,6,,...,1678749495,1678749496,2023-03-13 23:17:59,,,,,,,


In [11]:
df_flights[['subreddit', 'selftext', 'title', 'created_utc']].head()

Unnamed: 0,subreddit,selftext,title,created_utc
0,Flights,"To date, I just purchased my cheapest internat...",$9AUD Flight,1678772376
1,Flights,,I'm an engineer with a design to reverse engin...,1678761426
2,Flights,[removed],TAP AIR FLIGHT SONG,1678756695
3,Flights,Context: I got a flight centre gift card for $...,what would you do? awful rule regarding Porter...,1678753898
4,Flights,[removed],EVA Air CC Verification,1678749479


In [12]:
# Update the Flights dataframe with the 4500 succeeding posts
for i in range(9):
    df_flights = update_df(df_flights, 'flights')
    
df_flights.shape

(4996, 116)

In [13]:
df_flights[['subreddit', 'selftext', 'title', 'created_utc']].tail()

Unnamed: 0,subreddit,selftext,title,created_utc
4991,Flights,I just booked our flights and I realized I acc...,Do children need ID to travel domestically?,1491165745
4992,Flights,,UÇUŞ RÖTARINDA TAZMİNAT ALMAK,1491157678
4993,Flights,"Hi, just found this subreddit as I'm looking f...",Help figuring out a flight to Bulgaria,1491068847
4994,Flights,[deleted],Is this a bait and switch by Southwest?,1491021573
4995,Flights,"Hey all,\n\nMy friend and I are travelling fro...","Flying to Rome, transferring flight in London....",1490991509


### Collecting Hotels data:

In [14]:
# Set up url parameters for the first pull from the Beatles subreddit (first 500 posts)
params_hotels = {
    'subreddit': 'hotels', 
    'size': 500
}

In [15]:
# Create a base dataframe from the posts
df_hotels = create_new_df(get_posts (params_hotels))

In [16]:
# Look at the shape (rows, columns)
df_hotels.shape

(500, 92)

In [17]:
df_hotels.head()

Unnamed: 0,subreddit,selftext,author_fullname,gilded,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,...,num_crossposts,media,is_video,retrieved_utc,updated_utc,utc_datetime_str,author_cakeday,crosspost_parent_list,url_overridden_by_dest,crosspost_parent
0,hotels,I made a huge mistake booking a hotel for an u...,t2_15hx7g,0,Double booking + egregious hotel cancellation ...,[],r/hotels,False,6,,...,0,,False,1678763021,1678763022,2023-03-14 03:03:29,,,,
1,hotels,Hello All! \nI am currently applying for assis...,t2_a1jx5edk,0,Hotel Managment,[],r/hotels,False,6,,...,0,,False,1678760137,1678760138,2023-03-14 02:15:21,,,,
2,hotels,[removed],t2_umhk6tre,0,Parking pass,[],r/hotels,False,6,,...,0,,False,1678759338,1678759338,2023-03-14 02:02:06,,,,
3,hotels,[removed],t2_tggblfic,0,A Journey Through The Oceana Hotel Culinary Wo...,[],r/hotels,False,6,,...,0,,False,1678731758,1678731758,2023-03-13 18:22:24,,,,
4,hotels,[removed],t2_63yclmjaj,0,Chime Card at Las Vegas Hotels,[],r/hotels,False,6,,...,0,,False,1678730650,1678730651,2023-03-13 18:03:52,,,,


In [18]:
df_hotels[['subreddit', 'selftext', 'title', 'created_utc']].head()

Unnamed: 0,subreddit,selftext,title,created_utc
0,hotels,I made a huge mistake booking a hotel for an u...,Double booking + egregious hotel cancellation ...,1678763009
1,hotels,Hello All! \nI am currently applying for assis...,Hotel Managment,1678760121
2,hotels,[removed],Parking pass,1678759326
3,hotels,[removed],A Journey Through The Oceana Hotel Culinary Wo...,1678731744
4,hotels,[removed],Chime Card at Las Vegas Hotels,1678730632


In [19]:
# Update the Hotels dataframe with the 4500 succeeding posts
for i in range(9):
    df_hotels = update_df(df_hotels, 'hotels')
    
df_hotels.shape

(5000, 115)

In [20]:
df_hotels[['subreddit', 'selftext', 'title', 'created_utc']].tail()

Unnamed: 0,subreddit,selftext,title,created_utc
4995,hotels,,GruppenreisenUK,1408427884
4996,hotels,,Choose Family Hotels in Dubai for a Lovely Vac...,1408426969
4997,hotels,,Five must-see European castles,1408426560
4998,hotels,,Five must-see European castles,1408425765
4999,hotels,,"Doubletree Hotel in Port Huron, Michigan",1408392966


### Export data

In [21]:
#export flights dataframe to CSV
df_flights = df_flights[['subreddit', 'selftext', 'title', 'created_utc']]
df_flights.to_csv('../data/flights_subs.csv', index=False)

In [22]:
#export hotels dataframe to CSV
df_hotels = df_hotels[['subreddit', 'selftext', 'title', 'created_utc']]
df_hotels.to_csv('../data/hotels_subs.csv', index=False)

In [2]:
# split into train/test sets
trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2)
# fit a model
model = LogisticRegression(solver='lbfgs')
model.fit(trainX, trainy)
# predict probabilities
lr_probs = model.predict_proba(testX)
# keep probabilities for the positive outcome only
lr_probs = lr_probs[:, 1]
# predict class values
yhat = model.predict(testX)
lr_precision, lr_recall, _ = precision_recall_curve(testy, lr_probs)
lr_f1, lr_auc = f1_score(testy, yhat), auc(lr_recall, lr_precision)
# summarize scores
print('Logistic: f1=%.3f auc=%.3f' % (lr_f1, lr_auc))
# plot the precision-recall curves
no_skill = len(testy[testy==1]) / len(testy)
pyplot.plot([0, 1], [no_skill, no_skill], linestyle='--', label='No Skill')
pyplot.plot(lr_recall, lr_precision, marker='.', label='Logistic')
# axis labels
pyplot.xlabel('Recall')
pyplot.ylabel('Precision')
# show the legend
pyplot.legend()
# show the plot
pyplot.show()

NameError: name 'train_test_split' is not defined

## Continue to Notebook 2: Data Cleaning & EDA