## Setup

In [1]:
from pymongo import MongoClient

db = MongoClient('mongodb://143.215.138.132:27017')['big_data']

## Find Prolific Tweet Writers

Last time: Fetch all tweets, build a counter (dictionary) locally, and sort locally.

This time: Make counting and sorting queries to the server, and then fetch data.

In [2]:
group = {'$group': {'_id': '$author_name', 'count': {'$sum': 1}}}

sort = {'$sort': {'count': -1}}

limit = {'$limit': 1000}

pipeline = [group, sort, limit]

In [3]:
for element in db.tweet_subset.aggregate(pipeline):
    print(element)

{'_id': '.', 'count': 1289}
{'_id': '511 New York', 'count': 1135}
{'_id': 'ㅤ', 'count': 686}
{'_id': 'SF311 Reports', 'count': 538}
{'_id': 'SONIC Jobs', 'count': 532}
{'_id': '511NY - New Jersey', 'count': 530}
{'_id': 'B', 'count': 522}
{'_id': '511 NYC Area', 'count': 504}
{'_id': '✨', 'count': 417}
{'_id': 'lex', 'count': 375}
{'_id': 'Chris', 'count': 372}
{'_id': 'J', 'count': 364}
{'_id': 'Alex', 'count': 340}
{'_id': '👑', 'count': 336}
{'_id': 'em', 'count': 327}
{'_id': 'Michael', 'count': 323}
{'_id': 'Mike', 'count': 318}
{'_id': 'ash', 'count': 315}
{'_id': 'Jobs at VA', 'count': 314}
{'_id': 'Panera Careers', 'count': 309}
{'_id': 'Trendinalia USA', 'count': 309}
{'_id': 'Kindred Jobs', 'count': 291}
{'_id': 'Sarah', 'count': 291}
{'_id': 'Ryan', 'count': 290}
{'_id': 'Speedway Jobs', 'count': 287}
{'_id': 'Matt', 'count': 285}
{'_id': 'Lauren', 'count': 285}
{'_id': 'Emily', 'count': 273}
{'_id': 'SHC Careers', 'count': 269}
{'_id': 'Jay', 'count': 268}
{'_id': 'TMJ-LAX 

## Regions

1. Divide US and dataset into four regions

2. Match location data against these regions

In [4]:
matchNE = {'$match': {'lat': {'$gte': 36, '$lte': 50}, 'lon': {'$gte': -99, '$lte': -69}}}
matchSE = {'$match': {'lat': {'$gte': 25, '$lte': 36}, 'lon': {'$gte': -99, '$lte': -69}}}
matchNW = {'$match': {'lat': {'$gte': 36, '$lte': 50}, 'lon': {'$gte': -125, '$lte': -99}}}
matchSW = {'$match': {'lat': {'$gte': 25, '$lte': 36}, 'lon': {'$gte': -125, '$lte': -99}}}

limit = {'$limit': 1000}

pipeline = [matchNE, limit]

for tweet in db.tweet.aggregate(pipeline):
    print(str(tweet['lat']) + ' ' + str(tweet['lon']))
    print(tweet['text'])
    print()

42.9564115 -85.6411415
@patrykbiel then I suppose you're a keeper ❤️

41.7644 -71.9548
Can this time, more melony flavor, less hints of caramel - Drinking a Sap @ Club House - https://t.co/7FdOynOqgO #photo

38.898602999999994 -77.0143985
@vmartelll I'm actually crying right now 😭😂😭😭 https://t.co/T7ksfGfwPo

43.555244 -79.616073
"Did they know something we didn't" holy cow what a 🔥take this is... https://t.co/6Eq0sxu29s

44.6891045 -73.47540000000001
Just followed a truck all the way from Albany to Plattsburgh and now I'm bummed we are going spectate ways😅

38.993539 -76.887208
if I ever leave her they gone kill my family 😫

43.6943782 -79.5585958
Personal Injury Collision | Dixon Rd &amp; Kipling Ave [HP] 10/21 18:10 #The_Westway #Toronto

40.780709 -73.9685415
Ladd will pot his first tonight... #isles

43.629311 -79.2725695
@FreeTown30 @theB0SNIAN @TheRealDougler this was the case last time. Shit is surreal.

43.6567856 -79.4550575
Unknown Trouble | Bloor St W &amp; Edna Ave [11 Div.

## Basic Features of Linguistic Style

1. Bag of Words Model (Word Count)

2. Text Length

3. Stance Markers

...

In [5]:
unwind = {'$unwind': '$words'}

group = {'$group': {'_id': '$words', 'count': {'$sum': 1}}}

sort = {'$sort': {'count': -1}}

limit = {'$limit': 1000}

pipeline = [matchNE, unwind, group, sort, limit]

In [6]:
for element in db.tweet_subset.aggregate(pipeline):
    print(element)

{'_id': 'the', 'count': 139643}
{'_id': 'to', 'count': 124173}
{'_id': 'i', 'count': 122499}
{'_id': 'a', 'count': 100680}
{'_id': 'you', 'count': 80050}
{'_id': 'in', 'count': 77559}
{'_id': 'and', 'count': 73151}
{'_id': 'for', 'count': 64231}
{'_id': 'is', 'count': 59835}
{'_id': 'my', 'count': 58100}
{'_id': 'this', 'count': 56844}
{'_id': 'of', 'count': 55755}
{'_id': 'on', 'count': 42664}
{'_id': 't', 'count': 38714}
{'_id': 'it', 'count': 38698}
{'_id': 'co', 'count': 38384}
{'_id': 'https', 'count': 37858}
{'_id': 'me', 'count': 37634}
{'_id': 'that', 'count': 35175}
{'_id': 'be', 'count': 34617}
{'_id': 'at', 'count': 33294}
{'_id': 'so', 'count': 30781}
{'_id': "i'm", 'count': 28701}
{'_id': '-', 'count': 27180}
{'_id': 'with', 'count': 27019}
{'_id': 'just', 'count': 25780}
{'_id': 'have', 'count': 23606}
{'_id': 'like', 'count': 21961}
{'_id': 'but', 'count': 21495}
{'_id': 'not', 'count': 21227}
{'_id': 'are', 'count': 20666}
{'_id': 'was', 'count': 19746}
{'_id': 'all', '