### Load options, Groupby, Lambdas and Functions

This notebook gives you more sophisticated methods of maniputaling your data using lambda functions and more customized functions as well as a way to create more sophisticated ways to summarize your data. 

In [1]:
import pandas as pd

**NEW:** 
Every function comes with various options you can specify. Below we are looking at assigning data types you can assign to each column

In [None]:
%%time
tweets = pd.read_csv(
    '../data/ira_tweets_csv_hashed.csv', 
    dtype = {'tweetid': 'str','retweet_tweetid':'str'}, # <-- here you can specify data as strings, floats or integers,
    parse_dates = ['account_creation_date', 'tweet_time'] # <-- this line makes pandas interpret these columns as dates   
)

In [None]:
len(tweets)

In [None]:
tweets.dtypes

In [None]:
tweets.head().T

In [None]:
%%time
tweets_2016_2018 = tweets[
    tweets['tweet_time'].dt.year > 2015
]

In [None]:
print(len(tweets_2016_2018))
tweets_2016_2018.head()

In [None]:
grouped_tweets = tweets_2016_2018.groupby(['account_language','user_reported_location']).agg({'tweetid': 'count'})

In [None]:
grouped_tweets

In [None]:
grouped_tweets.reset_index()

In [None]:
grouped_tweets.reset_index().sort_values(by = 'tweetid', ascending=False)

### Lambdas
Lambdas are functions you can use to manipulate your columns. Think of them as mini-functions. You use them in conjunction with the `.apply()` function.

In [None]:
num_tweets_per_country_language = grouped_tweets.reset_index().sort_values(by = 'tweetid', ascending=False)

In [None]:
num_tweets_per_country_language['tweetid'].apply(lambda x: x/len(tweets_2016_2018) *100)

In [None]:
num_tweets_per_country_language['percent_of_all_tweets'] = num_tweets_per_country_language['tweetid'].apply(lambda x: (x/len(tweets_2016_2018)) *100)

In [None]:
num_tweets_per_country_language.head()

Here's how you do the same thing with a function:

In [None]:
def calculate_pct(x):
    return (x/len(tweets_2016_2018)) * 100

In [None]:
num_tweets_per_country_language['percent_of_all_tweets2'] = num_tweets_per_country_language['tweetid'].apply(calculate_pct)

In [None]:
num_tweets_per_country_language.head()