# Text Analytics Lecture - Week 3

Hello and welcome to our first hands-on-python lecture in our course about text analytics. Today we will be covering the fundamentals of Python for processing text. *Enjoy!*

## Downloading Data

First, we need to load our Trump tweet data into our session. The file comes in .json format, so we use the `json` package. We downloaded the data from [Trump Twitter Archive](http://www.trumptwitterarchive.com/). It is a fun data source and provides some great statistics that we will try to reproduce in this class.

In [2]:
import json

We open the file connection and load the data into the `trump_dict_list` variable.

In [3]:
with open('data/trump_tweets_small.json', encoding="utf8") as f:
    trump_dict_list = json.loads(f.read())

## Data Structures

Next, we investigate the structure of our data. A great function for this purpose is `type()`:

In [4]:
type(trump_dict_list)

list

Turns out that our object is a `list`. From the textbook, we know that
>*Lists are collections of arbitrary heterogeneous [...] objects. Lists also follow a sequence based on the order in which the objects are present in the list, and each object has its own index with which it can be accessed. * 

To sum it up: a `list` is an **indexed** and **heterogeneous** data type.

But how long is our list? We can find our the length of `trump_tweets`, using `len()`:

In [6]:
len(trump_dict_list)

2593

In [9]:
trump_dict_list[3:7:2]

[{'created_at': 'Sat Dec 30 19:02:53 +0000 2017',
  'favorite_count': 78932,
  'id_str': '947181212468203520',
  'is_retweet': False,
  'retweet_count': 23270,
  'text': 'Oppressive regimes cannot endure forever, and the day will come when the Iranian people will face a choice. The world is watching! https://t.co/kvv1uAqcZ9'},
 {'created_at': 'Sat Dec 30 03:42:58 +0000 2017',
  'favorite_count': 138901,
  'id_str': '946949708915924994',
  'is_retweet': False,
  'retweet_count': 60821,
  'text': 'Many reports of peaceful protests by Iranian citizens fed up with regime’s corruption &amp; its squandering of the nation’s wealth to fund terrorism abroad. Iranian govt should respect their people’s rights, including right to express themselves. The world is watching! #IranProtests'}]

Now, we want to investigate the type of objects that are contained in `trump_dict_list`. We index to the first element in our `list` and apply the `type()` function. **Please, keep in mind that in Python indexing starts with 0!**

In [10]:
type(trump_dict_list[0])

dict

It seems like the contained object is a `dict`, short for dictionary. From our textbook, we know that

> *Dictionaries [...] are key-value mappings that are unordered and mutable. Dictionaries are indexed using keys, which can be any immutable object type, like numeric types or strings [...]. Dictionary values can be immutable or mutable objects,
including lists and dictionaries themselves.*

To sum it up: a `dict` is a **key-indexed** and **unordered** data type.

In [12]:
trump_dict_list[0]

{'created_at': 'Sat Dec 30 22:42:09 +0000 2017',
 'favorite_count': 117013,
 'id_str': '947236393184628741',
 'is_retweet': False,
 'retweet_count': 24332,
 'text': 'Jobs are kicking in and companies are coming back to the U.S. Unnecessary regulations and high taxes are being dramatically Cut, and it will only get better. MUCH MORE TO COME!'}

New let's investigate the `dict` object. For the sake of simplicity, we create a new object `my_trump_dict`, so we don't need to index from `trump_dict_list` anymore.

In [13]:
my_trump_dict = trump_dict_list[1]

In [18]:
my_trump_dict

{'created_at': 'Sat Dec 30 22:36:41 +0000 2017',
 'favorite_count': 195754,
 'id_str': '947235015343202304',
 'is_retweet': False,
 'retweet_count': 50342,
 'text': 'I use Social Media not because I like to, but because it is the only way to fight a VERY dishonest and unfair “press,” now often referred to as Fake News Media. Phony and non-existent “sources” are being used more often than ever. Many stories &amp; reports a pure fiction!'}

In [20]:
my_trump_dict.get('text')

'I use Social Media not because I like to, but because it is the only way to fight a VERY dishonest and unfair “press,” now often referred to as Fake News Media. Phony and non-existent “sources” are being used more often than ever. Many stories &amp; reports a pure fiction!'

Useful functions when working with a `dict` are `.keys()` and `values()`.

In [21]:
my_trump_dict.keys()

dict_keys(['text', 'created_at', 'retweet_count', 'favorite_count', 'is_retweet', 'id_str'])

The functions `.keys()` shows us that the dictionary contains the text, the creation date, the retweet count, the favorite count, the flag if it is a retweet and the id of the tweet.

In [22]:
my_trump_dict.values()

dict_values(['I use Social Media not because I like to, but because it is the only way to fight a VERY dishonest and unfair “press,” now often referred to as Fake News Media. Phony and non-existent “sources” are being used more often than ever. Many stories &amp; reports a pure fiction!', 'Sat Dec 30 22:36:41 +0000 2017', 50342, 195754, False, '947235015343202304'])

As you can see, the `.values()` function shows us the values that hide behind the keys.

Now we will talk about how to extract specific values from a `dict`, based on a defined key. The function to go with here is `.get()`. It takes as input the name of a key in string representation:

In [9]:
my_trump_dict.get('text')

'I use Social Media not because I like to, but because it is the only way to fight a VERY dishonest and unfair “press,” now often referred to as Fake News Media. Phony and non-existent “sources” are being used more often than ever. Many stories &amp; reports a pure fiction!'

## Controlling Code Flow

In this section, we will talk about how to control the flow of your program. In particular, we will look at:  
1. Conditional constructs
2. Looping constructs

### Conditional Constructs

This chapter will deal with expressing locial statements in Python. If you are not familiar with basic logic or need a refresher, we recommend you to read *Representation of Semantics* of our textbook.

#### Logical Expressions

Let's start with the basics: `A == B` tests if `A` equals `B`. If it does, it returns `True`. Otherwise it returns `False`.

In [23]:
my_trump_dict == trump_dict_list[0]

False

In [24]:
my_trump_dict == trump_dict_list[1]

True

Next, `A or B` tests if `A` or `B` are `True`. If at least one item is `True`, then the expression is `True` as well.

In [25]:
my_trump_dict == trump_dict_list[0] or my_trump_dict == trump_dict_list[1]

True

Similarly, `A and B` tests if `A` and `B` are `True`. Only if all items are `True` the expression will evaluate to `True` as well.

In [26]:
my_trump_dict == trump_dict_list[0] and my_trump_dict == trump_dict_list[1]

False

`not A` flips the logical value of `A`. When `A` is `True`it returns `False` and vice versa.

In [14]:
not my_trump_dict == trump_dict_list[0]

True

A great logical operator in Python is `A in B`. It returns `True` if `A` is equal to one element in `B`.

In [15]:
my_trump_dict in trump_dict_list

True

In [27]:
"Fake News" in my_trump_dict.get('text')

True

When it comes to Python, there some many tricks for logical expressions that can come in handy sometimes, such as:

In [17]:
not list()

True

If you want to know more about this topic, you can find a great resource [here](http://thomas-cokelaer.info/tutorials/python/boolean.html)

#### If Statements

`if` statements are the backbone of control flow in Python (and basically any other language). The basic syntax is:  
```
if A:
    b()
else:
    c()
```
Where `A` is a logical value and `b()` and `c()` is an abitrary function. If `A` is `True`, then `b()` will be executed. Otherwise, `c()` will be executed. Keep in mind that the `else` part of an `if` statement is optional.

In [28]:
if my_trump_dict == trump_dict_list[0]:
    print("That is our favorite tweet!")
else:
    print("That is not what we are looking for...")

That is not what we are looking for...


### Exercise

**Question #1**: *Write the following code flow.*

*If `my_trump_dict` contains the term "Fake News" and the number of retweets if greater than 50,000, print "The tweet is about 'Fake News' and was very popular". If `my_trump_dict` contains the term 'Fake New' and the number of retweets is smaller or equal to 50,000, print "The tweet is about 'Fake News' but was not very popular". Otherwise, print "The tweet was not about 'Fake News'."*

In [29]:
my_trump_dict

{'created_at': 'Sat Dec 30 22:36:41 +0000 2017',
 'favorite_count': 195754,
 'id_str': '947235015343202304',
 'is_retweet': False,
 'retweet_count': 50342,
 'text': 'I use Social Media not because I like to, but because it is the only way to fight a VERY dishonest and unfair “press,” now often referred to as Fake News Media. Phony and non-existent “sources” are being used more often than ever. Many stories &amp; reports a pure fiction!'}

In [31]:
# Type your solution here:
if 'Fake News' in my_trump_dict.get('text'):
    if my_trump_dict.get('retweet_count') > 50000:
        print('A')
    else:
        print('B')
else:
    prinnt('C')

A


### Looping constructs

Now we will deal with another fundamental part of Python: looping operators. In this section, we will look at:  
1. The `for` loop
2. The `break` and `continue` statement

#### `for` Loops

`for` loops are the most basic looping mechanism in Python. Every `for` loop is built like this:  
```
for A in B:
    c()
```
Where `A` is a copy of an element of `B` that can be accessed from inside the loop and `c()` is an abitrary function.

`for` loops are very convenient for iterating through a `list`:

In [32]:
clinton_counter = 0
for trump_dict in trump_dict_list:
    if 'Clinton' in trump_dict.get('text'):
        clinton_counter += 1
print('Trump tweeted about the Clintons {} times.'.format(clinton_counter))

Trump tweeted about the Clintons 55 times.


#### `continue` and `break` statements

You do not always have to go through all iterations of a loop. `continue` and `break` help you to skip one iteration and quit the loop entirely, which can make your code much faster.

In [33]:
clinton_counter = 0
for trump_dict in trump_dict_list:
    if 'Hillary' in trump_dict.get('text'):
        continue
    if 'Clinton' in trump_dict.get('text'):
        clinton_counter += 1
print('Trump tweeted about the Clintons {} times.'.format(clinton_counter))

Trump tweeted about the Clintons 26 times.


### Exercise

**Question #1:** *Build the following looping construct.*  

*Count the number of times that Trump uses the word 'China'.*  
**Note**:  *Do not count the number of tweets that contain the word 'China' (the word might occur multiple times in one tweet).*

In [36]:
# Type your solution here
china_counter = 0
for trump_dict in trump_dict_list:
    for word in trump_dict.get('text').split():
        if word == 'China':
            china_counter += 1
print(china_counter)

41


In [35]:
assert china_counter == 41, "The solution should be 41, but your solution is {}.".format(china_counter)

**Question #2:** *Build the following looping construct.*  

*How many times did Trump tweet about Barack 'Obama' before talking about Bernie 'Sanders'?*

In [24]:
# Type your solution here
obama_count = 0

In [25]:
assert obama_count == 15, "The solution should be 15 but your solution is {}".format(obama_count)

AssertionError: The solution should be 15 but your solution is 0

## Functional Programming

We are starting to entre the more sophisticaed area of Python, introducing functional programming. In this chapter, we will begin to understand:  
1. Functions
2. Comprehensions  

**Note**: We will slightly digress from the textbook here, since we will not cover the following parts:  
* Recursive functions  
* Anonymous functions  
* Iterators  
* Generators  

You can find more information about these topics in the chapter *'Functional Programming'*.


### Functions

This is what our textbook has to say about functions:  
>*A function can be defined as a block of code that is executed only on request by invoking
it. Functions consist of a function definition that has the function signature (function
name, parameters) and a group of statements inside the function that are executed when
the function is called.*  

In general, this is how a function looks like:  
```
def function(params): # params are the input parameters
    <code block> # code block consists of a group of statements
    return value(s) # optional return statement
```

For example, we can build a simple function that checks whether the tweet contains the search term 'Fake News':

In [38]:
def tweet_contains_fake_news(tweet_dict):
    """
    Checks whether tweet contains 'Fake News'.
    
    Takes the 'text' element from the dictionary and matches the pattern 'Fake News' against it.
    
    Args:
        tweet: A dictionary that contains tweet data.
    Returns:
        A boolean value, corresponding to whether the tweet contains 'Fake News'.
    """
    
    contains_fake_news =  'Fake News' in tweet_dict.get('text')
    return contains_fake_news

Now that we have defined the function `tweet_contains_fake_news`, we can test its behavior.

In [41]:
tweet_contains_fake_news(trump_dict_list[0])

False

Knowing that the tweet actually contains the search term *'Fake News'*, the returned output seems correct.

### Exercise

**Question #1:** *Build the `count_words_in_tweet()` function.*

In [49]:
# Type your solution here

def count_words_in_tweet(tweet_dict):
    """
    Counts the words in a tweet.
    
    Retrieves the text from the tweet, splits it by spaces and counts
    the length of the list.
    
    Args:
        tweet: A dictionary that contains tweet data
    
    Returns:
        An interger value, corresponding to the number of words in the tweet.
    """
    word_count = len(tweet_dict.get('text').split())
    return word_count

In [50]:
print(count_words_in_tweet(my_trump_dict))

50


Again let's test our function `count_words_in_tweet`:

In [51]:
assert count_words_in_tweet(my_trump_dict) == 50, "The solution should be 50 but your solution is {}.".format(
    count_words_in_tweet(my_trump_dict))

In [30]:
my_trump_dict.get('text')

'I use Social Media not because I like to, but because it is the only way to fight a VERY dishonest and unfair “press,” now often referred to as Fake News Media. Phony and non-existent “sources” are being used more often than ever. Many stories &amp; reports a pure fiction!'

The returned value (50) seems to align with the number of words in the tweet. That's good!

**Question #2:** *In average, how many words does Trump use per tweet?*  
**Hint:** Use the `count_words_in_tweet()` function.

In [53]:
# Solution as a loop:

      

word_avg = 0

IndentationError: expected an indented block (<ipython-input-53-58ec050afd63>, line 7)

In [32]:
assert round(word_avg) == 21, "The solution should be 21 but your solution is {}.".format(round(word_avg))

AssertionError: The solution should be 21 but your solution is 0.

The solution is perfectly valid. However, here is another implementation, using the `statistics` package:

In [33]:
import statistics

In [34]:
# Type your solution here
pass

The results are equivalent - *a good sign!*

### Comprehensions

Comprehensions are a very *pythonic* way of programming. They allow to execute `for` loops in a more code-efficient way.  
```
# typical comprehension syntax
[ expression for item in iterable ]

# equivalent for loop statement
for item in iterable:
    expression
```

This is a nice example for the beauty of comprehensions:

We can extract the text of all tweets that contain the word 'Mexico' in just one line of code.  
In comparison, it takes 5 lines of code, using a `for` loop.

In [35]:
mexico_tweets = list()
for trump_dict in trump_dict_list:
    if 'Mexico' in trump_dict.get('text'):
        mexico_tweets.append(trump_dict)
len(mexico_tweets)

18

In [52]:
len([trump_dict for trump_dict in trump_dict_list if 'Mexico' in trump_dict.get('text')])

18

### Exercise

**Question #1:** *Calculate the mean number of words in all tweets that longer than 50 characters.*  
**Note:** Try using comprehensions.

In [57]:
import statistics

In [60]:
# Solution as a loop.

word_list = list()

for trump_dict in trump_dict_list:
    char_len = len(trump_dict.get('text'))
    word_count = len(trump_dict.get('text').split())    
    if char_len > 50:
        word_list.append(word_count)

word_avg = statistics.mean(word_list)
print(word_avg)

21.765820233776704


In [64]:
# Solution as comprehension

statistics.mean([len(trump_dict.get('text').split()) for trump_dict in trump_dict_list if len(trump_dict.get('text')) > 50])

21.765820233776704

In [37]:
# Solution as a comprehension.

statis

In [38]:
assert round(long_tweet_avg) == 53, "The solution should be 53 but your solution is {}.".format(round(long_tweet_avg))

AssertionError: The solution should be 53 but your solution is 0.

## Classes

Python incorporates many principles of *object-oriented-programming* (OOP), such as classes. In a nutshell, classes are a model of a real-world entity. Anything can be a class: a *'car'*, a *'neural network'* or a *'tweet'*. If you are new to the world of OOP, this [video](https://www.youtube.com/watch?v=lbXsrHGhBAU) will provide you with a great general introduction to the topic. In our course, we will learn about classes in a learning-by-doing approach.  

In the first step, we will design a class that represents a tweet in our data set.

In [66]:
class Tweet(object):
    """Class represents tweet from Twitter.

    Attributes:
        tweet_id: A unique identifier.
        created_at: A date-time, describing when the tweet was sent.
        text: A string, containing the message of the tweet.
        is_retweet: A boolean, stating whether it is a retweet.
        retweet_count: An integer, stating the number of retweets.
        favorite_count: An integer, stating the number of favorites.
    """

    def __init__(self, tweet_id, created_at, text, is_retweet, retweet_count, favorite_count):
        """Initialzes Tweet class with defined content."""
        self.tweet_id = tweet_id
        self.created_at = created_at
        self.text = text
        self.is_retweet = is_retweet
        self.retweet_count = retweet_count
        self.favorite_count = favorite_count

### Introduction to `__init__()`

The `__init__()` function is what we call the *constructor*. It takes a set of inputs and and returns a object of class `Tweet`. To distinguish between the input parameters and the internal values, we use the `self.` operator. Keep in mind that you do not call the `__init__()` function directly. Instead, we call the constructor over the name of the class, in our case `Tweet()`:

In [68]:
my_trump_tweet = Tweet(my_trump_dict.get('id_str'), my_trump_dict.get('created_at'),
                       my_trump_dict.get('text'), my_trump_dict.get('is_retweet'), 
                       my_trump_dict.get('retweet_count'), my_trump_dict.get('favorite_count'))
type(my_trump_tweet)

__main__.Tweet

In [70]:
my_trump_tweet.tweet_id

'947235015343202304'

### Exercise

You can see that an object of class `Tweet` has been created. However, the construction of the object was quite tedious. Let's build a static function `create_tweet_from_dict()` that takes in a `dict` with tweet information and constructs a `Tweet` object:

In [71]:
# Type your solution here

def create_tweet_from_dict(tweet_dict):
    """
    Creates a tweet object from dictionary.
    
    Extracts tweet_id, created_at, text, is_retweet,
    retweet_count and favorite_count from dictionary.
    
    Args:
        tweet_dict: A dictionary, containing tweet information.
        
    Returns:
        A tweet object.
    """
    # Extract parameters from dictionary
    tweet_id = tweet_dict.get('id_str')
    created_at = tweet_dict.get('created_at')
    text = tweet_dict.get('text')
    is_retweet = tweet_dict.get('is_retweet')
    retweet_count = tweet_dict.get('retweet_count')
    favorite_count = tweet_dict.get('favorite_count')
    
    # Create tweet object
    tweet = Tweet(tweet_id, created_at, text, is_retweet, retweet_count, favorite_count)
    
    return tweet

Now, let's test our new `create_tweet_from_dict()` function:

In [73]:
my_trump_tweet = create_tweet_from_dict(my_trump_dict)

In [75]:
type(my_trump_tweet)

__main__.Tweet

In [43]:
assert type(my_trump_tweet) == Tweet, "Your object is not of type Tweet but {}.".format(
type(my_trump_tweet))

In [44]:
assert my_trump_tweet.text == my_trump_dict.get('text'), "The content of the two elements does not match."

This creates an equal object in a much more convenient way. *Well done!*  
We can now create a list of objects of class `Tweet` that we will call `new_trump_tweets`:

In [76]:
trump_tweet_list = list()
for trump_dict in trump_dict_list:
    trump_tweet = create_tweet_from_dict(trump_dict)
    trump_tweet_list.append(trump_tweet)
len(trump_tweet_list)

2593

### Introduction to `__str__()`

In [77]:
print(my_trump_dict)

{'text': 'I use Social Media not because I like to, but because it is the only way to fight a VERY dishonest and unfair “press,” now often referred to as Fake News Media. Phony and non-existent “sources” are being used more often than ever. Many stories &amp; reports a pure fiction!', 'created_at': 'Sat Dec 30 22:36:41 +0000 2017', 'retweet_count': 50342, 'favorite_count': 195754, 'is_retweet': False, 'id_str': '947235015343202304'}


Next, we want to build a function that returns an eye-friendly representation of our `Tweet` object. What about something like this:

In [78]:
def string_tweet(tweet):
    return "Tweet id: {}\nCreated at: {}\nIs retweet: {}\nRetweet count: {}\nFavorite count: {}\nText:\n{}".format(
        tweet.tweet_id, tweet.created_at, tweet.is_retweet,
        tweet.retweet_count, tweet.favorite_count, tweet.text)

In [79]:
print(string_tweet(my_trump_tweet))

Tweet id: 947235015343202304
Created at: Sat Dec 30 22:36:41 +0000 2017
Is retweet: False
Retweet count: 50342
Favorite count: 195754
Text:
I use Social Media not because I like to, but because it is the only way to fight a VERY dishonest and unfair “press,” now often referred to as Fake News Media. Phony and non-existent “sources” are being used more often than ever. Many stories &amp; reports a pure fiction!


What a nice representation for our tweet!  

We can now integrate the `string_tweet()` function into our `Tweet` class. The trick is to override the `__str__()` function, which returns a user-friendly representation of the object. Every class in Python comes innately with a `__str__()` function.

In [80]:
Tweet.__str__ = string_tweet

The `__str__()` function is especially useful, since it is called by other functions, such as `print()`:

In [81]:
print(my_trump_tweet)

Tweet id: 947235015343202304
Created at: Sat Dec 30 22:36:41 +0000 2017
Is retweet: False
Retweet count: 50342
Favorite count: 195754
Text:
I use Social Media not because I like to, but because it is the only way to fight a VERY dishonest and unfair “press,” now often referred to as Fake News Media. Phony and non-existent “sources” are being used more often than ever. Many stories &amp; reports a pure fiction!


### Introduction to `__eq__()`

Next, we want to determine whether two Tweets are identical. For this purpose, let's create a copy of `my_trump_tweet`, called `my_trump_tweet_copy`:

In [82]:
my_trump_tweet_copy = create_tweet_from_dict(my_trump_dict)

You can see that the two objects are equal, when comparing the `tweet_id`.

In [83]:
print(my_trump_tweet.tweet_id)
print(my_trump_tweet_copy.tweet_id)

947235015343202304
947235015343202304


However, checking for equality returns `False`. Why is that?

In [84]:
my_trump_tweet == my_trump_tweet_copy

False

It turns out that although the content of the objects is *equal*, they are not *identical*. This is because, they are two separate objects with different pointers in memory:

In [53]:
my_trump_tweet

<__main__.Tweet at 0x2ab434aa518>

In [54]:
my_trump_tweet_copy

<__main__.Tweet at 0x2ab4352b518>

So how do we define equality for two objects of class `Tweet`? The simplest approach is to compare the unique `tweet_id`. If the `tweet_id` is equal, then the two objects are equal as well.

In [85]:
def equals_tweet(first_tweet, second_tweet):
    return first_tweet.tweet_id == second_tweet.tweet_id

We can see that our new function `equals_tweet()` returns `True`. A good result!

In [86]:
equals_tweet(my_trump_tweet, my_trump_tweet_copy)

True

Now, we can integrate the `equals_tweet()` function into our `Tweet` class, by overriding the `__eq__()` function, which tests for equality for two object of the class. Again, the `__eq__()` comes innate with all classes in Python.

In [87]:
Tweet.__eq__ = equals_tweet

When we check for equality now, the system returns `True`. *Isn't that great?*

In [88]:
my_trump_tweet == my_trump_tweet_copy

True

### Introduction to `__hash__()`

Currently, we are storing our tweets in a `list`, which comes with one big flaw: a `list` can contain duplicates of the same object. Therefore, we can add the same tweet over and over again:

In [89]:
len(trump_tweet_list)

2593

In [90]:
trump_tweet_list.append(my_trump_tweet)
len(trump_tweet_list)

2594

Fortunately, there is a data type perfectly tailored for our purpose: a `set`. From our textbook, we know that  
>*Sets are unordered collections of unique and immutable objects [...]. Sets are typically used to remove
duplicates from a list, test memberships, and perform mathematical set operations,
including union, intersection, difference, and symmetric difference.*  

To sum it up: a `set` is **unordered** and **immutable**.

In [91]:
def hash_tweet(tweet):
    return hash(tweet.tweet_id)

In [92]:
Tweet.__hash__ = hash_tweet

In [99]:
trump_tweet_set = set(trump_tweet_list)
len(trump_tweet_set)

2593

The set has already automatically removed the duplicate!

### Exercises

In [93]:
def calculate_popularity(tweet):
    return tweet.retweet_count + tweet.favorite_count

In [95]:
Tweet.calculate_popularity = calculate_popularity

In [96]:
my_trump_tweet.calculate_popularity()

246096

**Question #1:** Find the most popular Trump tweet.

In [101]:
most_popular_tweet = None
highest_popularity = -1

for trump_tweet in trump_tweet_set:
    tweet_popularity = trump_tweet.calculate_popularity()
    if tweet_popularity > highest_popularity:
        most_popular_tweet = trump_tweet
        highest_popularity = tweet_popularity
        
print(most_popular_tweet)

Tweet id: 881503147168071680
Created at: Sun Jul 02 13:21:42 +0000 2017
Is retweet: False
Retweet count: 369530
Favorite count: 605098
Text:
#FraudNewsCNN #FNN https://t.co/WYUnHjjUjg


In [69]:
most_popular_tweet = None

In [70]:
assert most_popular_tweet.tweet_id == 881503147168071680, "You have not found the most popular tweet."

AttributeError: 'NoneType' object has no attribute 'tweet_id'

You can also solve this in one-line, using the `key` parameter inside the `max()` function.

In [72]:
# Type your solution here


## Working with Text

Now that we have aquired a basic knowledge about Python, we can start working with the fundamentals of text analytics.

### String Indexing

You can index strings the same way you index lists: using the `[]` operator.

In [64]:
my_trump_tweet.text

'I use Social Media not because I like to, but because it is the only way to fight a VERY dishonest and unfair “press,” now often referred to as Fake News Media. Phony and non-existent “sources” are being used more often than ever. Many stories &amp; reports a pure fiction!'

The syntax of `[a:b:c]` goes as follows:
- `a` is the start index (default: 0)
- `b` is the end index (defaut: length of string)
- `c` is the stepsize (default: 1)

In [65]:
my_trump_tweet.text[:18]

'I use Social Media'

In [66]:
my_trump_tweet.text[0] + my_trump_tweet.text[75:110] + my_trump_tweet.text[111:116]

'I fight a VERY dishonest and unfair press'

In [67]:
my_trump_tweet.text[-13:-1]

'pure fiction'

In [68]:
my_trump_tweet.text[-13:-1:2]

'pr ito'

### String Methods

There exist many pre-implemented function to handle strings:

In [69]:
tweet_method = my_trump_tweet.text[0] + my_trump_tweet.text[75:110] + my_trump_tweet.text[111:116]

In [70]:
tweet_method

'I fight a VERY dishonest and unfair press'

In [71]:
tweet_method.lower()

'i fight a very dishonest and unfair press'

In [72]:
tweet_method.upper()

'I FIGHT A VERY DISHONEST AND UNFAIR PRESS'

In [73]:
tweet_method.replace('press', 'banana')

'I fight a VERY dishonest and unfair banana'

In [74]:
tweet_method.split()

['I', 'fight', 'a', 'VERY', 'dishonest', 'and', 'unfair', 'press']

### Regular Expressions

In [76]:
import re

In [79]:
re.search("News", my_trump_tweet.text)

<_sre.SRE_Match object; span=(149, 153), match='News'>

### Excercises

In [76]:
# How many Trump tweets end with "!"?
exclamation_point_counter = 0

for trump_tweet in trump_tweet_set:
    if trump_tweet.text[-1] == "!":
        exclamation_point_counter += 1
        
print("{} Trump tweets end with '!'.".format(exclamation_point_counter))

935 Trump tweets end with '!'.


In [108]:
assert exclamation_point_counter == 935, "The solution is 935 but you found {}".format(exclamation_point_counter)

NameError: name 'exclamation_point_counter' is not defined