# Course Project

## CS 242: Information Retrieval & Web Search
### Winter 2019

## Build a Search Engine

You must work in teams of four. If you cannot find a partner, email the TA (Merlin if you are ground student and Nhat if you are online) to connect you to other students who are looking for a partner. Teams must be formed by end of 2nd week of classes, and their composition emailed to the corresponding TA.

Each project report must have a section called "Collaboration Details" where you should clearly specify the contributions of each member of the team.

### Part A: Collect your data and Index with Lucene

#### A1: You have the following options:

1. Crawl the Web to get Web pages using jsoup (http://jsoup.org/). You may also use Scrapy (https://scrapy.org/) if you prefer Python. You may restrict pages to some category, e.g., edu pages, or pages with at least five images, etc.

2.  Crawl the Web to get images with their captions and names (to be used for indexing in next parts) using jsoup or Scrapy. Only use smaller imaged (<200KB) so you don’t stress our Hadoop cluster later.

3.  Use Twitter Streaming API (https://developer.twitter.com/en/docs/tutorials/consuming-streaming-data.html) to get Tweets. You can also use Tweepy (tweepy.org) if you prefer Python. (hint: Filter to only collect geotagged tweets, so you can then display them on a map in Part B.)

4.  Your own ideas for a dataset are also acceptable, pending instructor approval.

***Collect at least 5 GB of data, but no more than 10GB.
We recommend using Java, but not required.***

#### A2: Index your data using Lucene (not Solr)
You will be graded on the correctness and efficiency of your solution (e.g., how does the crawler handle duplicate pages? Is the crawler multi-threaded? How do you store the incoming tweets to maximize throughput?), and the design choices made when using Lucene (e.g., did you remove stop words, and why?  Or did you index hashtags separately from keywords and why?).




[![IMAGE ALT TEXT HERE](http://i3.ytimg.com/vi/rhBZqEWsZU4/maxresdefault.jpg)](https://www.youtube.com/watch?v=rhBZqEWsZU4) 

In [1]:
from tweepy import API
from tweepy import Cursor
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler 
from tweepy import Stream

import twitter_credentials 

## Twitter Classes 

Client: Extract all relavent information about a specified twitter user("client")

Streamer: Streams and Process life tweets. It handels Authentication and the connection to twitter api.

Lister: Basic listener class, receives new data and decides how to handel it.

In [29]:

class TwitterClient():
    
    ## Specify a twitter user or else it defaults back to itself 
    def __init__(self, twitter_user = None):
        self.auth = TwitterAuthenticator().authenticate_twitter_app()
        self.twitter_client = API(self.auth)
        self.twitter_user = twitter_user
    
    def get_user_timeline_tweets(self, num_tweets ):
        tweets = []
        for tweet in Cursor(self.twitter_client.user_timeline, id=self.twitter_user).items(num_tweets):
            tweets.append(tweet)
        return tweets
    
    def get_friend_list(self, num_friends):
        friend_list =[]
        for friend in Cursor(self.twitter_client.friends, id=self.twitter_user).items(num_friends):
            friend_list.append(friend)
        return friend_list
    
    
    def get_home_timeline_tweets(self, num_tweets):
        home_timeline_tweets = []
        for tweets in Cursor(self.twitter_client.home_timeline, id =self.twitter_user).items(num_tweets):
            home_timeline_tweets.append(tweets)
        return home_timeline_tweets
    
    
class TwitterAuthenticator():
    
    def authenticate_twitter_app(self):
        auth = OAuthHandler(twitter_credentials.CONSUMER_KEY, twitter_credentials.CONSUMER_SECRET)
        auth.set_access_token(twitter_credentials.ACCESS_TOKEN, twitter_credentials.ACCESS_TOKEN_SECRET)
        return auth



# # # # TWITTER STREAMER # # # #
class TwitterStreamer():
    """
    Class for streaming and processing live tweets.
    """
    def __init__(self):
        self.twitter_autenticator = TwitterAuthenticator()    

    def stream_tweets(self, fetched_tweets_filename, hash_tag_list):
        # This handles Twitter authetification and the connection to Twitter Streaming API
        listener = TwitterListener(fetched_tweets_filename)
        auth = self.twitter_autenticator.authenticate_twitter_app() 
        stream = Stream(auth, listener)

        # This line filter Twitter Streams to capture data by the keywords: 
        stream.filter(track=hash_tag_list)


class TwitterListener(StreamListener):
    def __init__(self, fetched_tweets_filename):
        self.fetch_tweets_filename = fetched_tweets_filename
        
    #define how to deal with the data 
    def on_data(self,data):
        try:
            print(data)
            with open(self.fetch_tweets_filename, 'a') as tf:
                tf.write(data)
            return True
        except BaseException as e:
            print("Error on_data: %s " %str(e))
        return True
    
    #override to deal with errors 
    def on_error(self,status):
        if status == 420:
            #return False on_data method in case we reach twitter brake limit 
            return False
        print(status)
        

# Algorithm for extracting data 

![title](img/web_crawler.jpg)


In [30]:
hash_tag_list = ["elon musk, donal trump"]
fetch_tweets_filename = " elon_tweets.json"

twitter_client = TwitterClient('elonmusk')

print(twitter_client.get_user_timeline_tweets(1))

# twitter_streamer = TwitterStreamer()
# twitter_streamer.stream_tweets(fetch_tweets_filename, hash_tag_list)


[Status(_api=<tweepy.api.API object at 0x7fd68b91a2b0>, _json={'created_at': 'Thu Jan 31 22:13:46 +0000 2019', 'id': 1091097233821323264, 'id_str': '1091097233821323264', 'text': '@Erdayastronaut @keego73 Absolutely. You’ve touched on a very important point. The ship must be easy to repair on the moon and Mars.', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'Erdayastronaut', 'name': 'Everyday Astronaut', 'id': 3167257102, 'id_str': '3167257102', 'indices': [0, 15]}, {'screen_name': 'keego73', 'name': 'Matthew Keegan', 'id': 2264130601, 'id_str': '2264130601', 'indices': [16, 24]}], 'urls': []}, 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', 'in_reply_to_status_id': 1091071464919453696, 'in_reply_to_status_id_str': '1091071464919453696', 'in_reply_to_user_id': 3167257102, 'in_reply_to_user_id_str': '3167257102', 'in_reply_to_screen_name': 'Erdayastronaut', 'user': {'id': 44196397, 'id_s

In [None]:
%lsmagic

In [None]:
%ls

In [None]:
%%time
square_evens = [n*n for n in range(100000000)]
