# Python Data Science Toolbox (Part 2)
Run the hidden code cell below to import the data used in this course.

In [1]:
# Import the course packages
import pandas as pd
import matplotlib.pyplot as plt

# Import the course datasets 
world_ind = pd.read_csv('datasets/world_ind_pop_data.csv')
tweets = pd.read_csv('datasets/tweets.csv')

## Explore Datasets
Use the DataFrames imported in the first cell to explore the data and practice your skills!
- Create a `zip` object containing the `CountryName` and `CountryCode` columns in `world_ind`. Unpack the resulting `zip` object and print the tuple values.
- Use a list comprehension to extract the first 25 characters of the `text` column of the `tweets` DataFrame provided that the tweet is not a retweet (i.e., starts with "RT").
- Create an iterable reader object so that you can use `next()` to read `datasets/world_ind_pop_data.csv` in chunks of 20.

In [2]:
world_ind[["CountryName","CountryCode"]]

Unnamed: 0,CountryName,CountryCode
0,Arab World,ARB
1,Caribbean small states,CSS
2,Central Europe and the Baltics,CEB
3,East Asia & Pacific (all income levels),EAS
4,East Asia & Pacific (developing only),EAP
...,...,...
13369,Virgin Islands (U.S.),VIR
13370,West Bank and Gaza,WBG
13371,"Yemen, Rep.",YEM
13372,Zambia,ZMB


In [3]:
# Create a zip object containing the CountryName and CountryCode columns in world_ind
Country_Code = zip(world_ind["CountryName"],world_ind["CountryCode"])
print(type(Country_Code))

<class 'zip'>


In [12]:
# Unpack the resulting zip object and print the tuple values.
for country,code in Country_Code:
    print(country,code)

In [19]:
#Use a list comprehension to extract the first 25 characters of the text column of the tweets DataFrame provided that the tweet is not a retweet (i.e., starts with "RT").

not_retweet = [tweet[:25] for tweet in tweets["text"] if tweet[:2]!="RT"]
print(not_retweet)

['Njihuni me Zonjën Trump !', 'Your an idiot she shouldn', 'Your an idiot she shouldn', '#HillYes #ImWithHer #Roll', "Trump won't do a yes ma'a", '#HillYes #ImWithHer #Roll', 'Opinion: The big story is', 'GOP speechwriter: By Nove', 'This dude must have some ', 'Opinion: The big story is', 'It Cometh from the Pit. A', '@footlooseracer @hautedam', 'PSA: @piersmorgan is a as', 'Me listening to DONALD TR', 'PSA: @piersmorgan is a as', 'Susan Sarandon Shares Int', '@jbrading dude you are an', 'Susan Sarandon Shares Int', '@realDonaldTrump Its too ', 'Photo: #Donald #Trump #Pr', '@jbrading dude you are an', 'Photo: #Donald #Trump #Pr', '@realDonaldTrump @MELANIA', '@realDonaldTrump Its too ', "I just saw this. I'm spee", "I just saw this. I'm spee", 'Trump campaign chief char', '@realDonaldTrump @MELANIA', '@ErinBurnett @Bakari_Sell', '@ErinBurnett @Bakari_Sell', '@noreallyhowcome @TVinebe', 'Trump who prides himself ', 'Judicial Watch: Obama Adm', "I don't understand how an", 'Donald Trump

In [21]:
#Create an iterable reader object so that you can use next() to read datasets/world_ind_pop_data.csv in chunks of 20.

chunk = pd.read_csv("datasets/world_ind_pop_data.csv",chunksize=20)
print(chunk)

print(next(chunk))
print(next(chunk))

<pandas.io.parsers.readers.TextFileReader object at 0x7f2a80b75610>
                                      CountryName  ... Urban population (% of total)
0                                      Arab World  ...                     31.285384
1                          Caribbean small states  ...                     31.597490
2                  Central Europe and the Baltics  ...                     44.507921
3         East Asia & Pacific (all income levels)  ...                     22.471132
4           East Asia & Pacific (developing only)  ...                     16.917679
5                                       Euro area  ...                     62.096947
6       Europe & Central Asia (all income levels)  ...                     55.378977
7         Europe & Central Asia (developing only)  ...                     38.066129
8                                  European Union  ...                     61.212898
9        Fragile and conflict affected situations  ...                     17.8919

## Iterating over iterables: next()

In [2]:
word = "Da"
it = iter(word) #iter() makes object iterable

print(next(it)) #next() is used to iterate over iterables
print(next(it))

D
a


## Iterating at once with *

In [3]:
word = "Data"
it = iter(word)

print(*it) #iterates iterable all at once

D a t a


## Using enumerate()

enumerate() --> consists of pairs containing the elements of the original iterable, along with their index within iterable

In [4]:
avengers = ["hawkeye","black widow","thor","hulk","ironman","captain america"]
e = enumerate(avengers)
print(e)
print(type(e))

enumerate_list = list(e)
print(enumerate_list)

<enumerate object at 0x7f790e8501c0>
<class 'enumerate'>
[(0, 'hawkeye'), (1, 'black widow'), (2, 'thor'), (3, 'hulk'), (4, 'ironman'), (5, 'captain america')]


## enumerate() and unpack 

In [5]:
avengers = ["hawkeye","black widow","thor","hulk","ironman","captain america"]
for index, value in enumerate(avengers):
    print(index,value)

0 hawkeye
1 black widow
2 thor
3 hulk
4 ironman
5 captain america


## Using zip()

In [1]:
avengers = ["hawkeye","black widow","thor","hulk","ironman","captain america"]
names = ["barton","scarlett","odin","bruce","stark","steve"]

z = zip(avengers,names)
print(z)
print(type(z))

zip_list = list(z)
print(zip_list)

<zip object at 0x7f3411f25ac0>
<class 'zip'>
[('hawkeye', 'barton'), ('black widow', 'scarlett'), ('thor', 'odin'), ('hulk', 'bruce'), ('ironman', 'stark'), ('captain america', 'steve')]


## zip() and unpack

In [2]:
avengers = ["hawkeye","black widow","thor","hulk","ironman","captain america"]
names = ["barton","scarlett","odin","bruce","stark","steve"]

for avenger,name in zip(avengers,names):
    print(avenger,name)

hawkeye barton
black widow scarlett
thor odin
hulk bruce
ironman stark
captain america steve


## Using * and zip to 'unzip'

- Create a zip object by using zip() on mutants and powers, in that order. Assign the result to z1.
- Print the tuples in z1 by unpacking them into positional arguments using the * operator in a print() call.
- Because the previous print() call would have exhausted the elements in z1, recreate the zip object you defined earlier and assign the result again to z1.
- 'Unzip' the tuples in z1 by unpacking them into positional arguments using the * operator in a zip() call. Assign the results to result1 and result2, in that order.
- The last print() statements prints the output of comparing result1 to mutants and result2 to powers. Click Submit Answer to see if the unpacked result1 and result2 are equivalent to mutants and powers, respectively.

In [2]:
mutants = ('charles xavier',
 'bobby drake',
 'kurt wagner',
 'max eisenhardt',
 'kitty pryde')

powers = ('telepathy',
 'thermokinesis',
 'teleportation',
 'magnetokinesis',
 'intangibility')

In [3]:
# Create a zip object from mutants and powers: z1
z1 = zip(mutants,powers)

# Print the tuples in z1 by unpacking with *
print(*z1)

# Re-create a zip object from mutants and powers: z1
z1 = zip(mutants,powers)

# 'Unzip' the tuples in z1 by unpacking with * and zip(): result1, result2
result1, result2 = zip(*z1)

print(result1)
print(result2)

# Check if unpacked tuples are equivalent to original tuples
print(result1 == mutants)
print(result2 == powers)

('charles xavier', 'telepathy') ('bobby drake', 'thermokinesis') ('kurt wagner', 'teleportation') ('max eisenhardt', 'magnetokinesis') ('kitty pryde', 'intangibility')
('charles xavier', 'bobby drake', 'kurt wagner', 'max eisenhardt', 'kitty pryde')
('telepathy', 'thermokinesis', 'teleportation', 'magnetokinesis', 'intangibility')
True
True


## Loading data in chunks

- Initialize an empty dictionary counts_dict for storing the results of processing the Twitter data.
- Iterate over the 'tweets.csv' file by using a for loop. Use the loop variable chunk and iterate over the call to pd.read_csv() with a chunksize of 10.
- In the inner loop, iterate over the column 'lang' in chunk by using a for loop. Use the loop variable entry.

In [3]:
# Initialize an empty dictionary: counts_dict
counts_dict = {}

# Iterate over the file chunk by chunk
for chunk in pd.read_csv("datasets/tweets.csv",chunksize=10):

    # Iterate over the column in DataFrame
    for entry in chunk["lang"]:
        if entry in counts_dict.keys():
            counts_dict[entry] += 1
        else:
            counts_dict[entry] = 1

# Print the populated dictionary
print(counts_dict)


{'en': 97, 'et': 1, 'und': 2}


## A list comprehension

In [5]:
nums = [12,4,56,7,83]

#list comprehension[desired output & "for loop" for desired output]
new_nums = [num+1 for num in nums] 

print(new_nums)

[13, 5, 57, 8, 84]


## List comprehension for "nested loop"

In [8]:
#[desired output & "nested for loop" for desired output]
new_nums = [(num1,num2) for num1 in range(0,2) for num2 in range (6,8)]
print(new_nums)

[(0, 6), (0, 7), (1, 6), (1, 7)]


# Advanced Comprehensions

## Conditionals on the iterable

In [3]:
print([num**2 for num in range(0,10) if num%2 == 0]) 
#output expression -> num**2
#iterable -> for num in range(0,10) if num%2 == 0

[0, 4, 16, 36, 64]


## Conditionals on the output expression

In [4]:
#output expression -> num**2 if num%2==0 else 0
#iterable -> for num in range(0,10)
[num**2 if num%2==0 else 0 for num in range(0,10)]

[0, 0, 4, 0, 16, 0, 36, 0, 64, 0]

# Generator Expression

- Create a generator object that will produce values from 0 to 30. Assign the result to result and use num as the iterator variable in the generator expression.
- Print the first 5 values by using next() appropriately in print().
- Print the rest of the values by using a for loop to iterate over the generator object.

In [5]:
# Create generator object: result
result = (num for num in range(0,31))
print(type(result))

# Print the first 5 values
print(next(result))
print(next(result))
print(next(result))
print(next(result))
print(next(result))

# Print the rest of the values
for value in result:
    print(value)


<class 'generator'>
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


## Generator Functions

**Generator functions** are functions that, like generator expressions, yield a series of values, instead of returning a single value. A generator function is defined as you do a regular function, but whenever it generates a value, it uses the keyword yield instead of return.

- Complete the function header for the function get_lengths() that has a single parameter, input_list.
- In the for loop in the function definition, yield the length of the strings in input_list.
- Complete the iterable part of the for loop for printing the values generated by the get_lengths() generator function. Supply the call to get_lengths(), passing in the list lannister.

In [6]:
# Create a list of strings
lannister = ['cersei', 'jaime', 'tywin', 'tyrion', 'joffrey']

# Define generator function get_lengths
def get_lengths(input_list):
    """Generator function that yields the
    length of the strings in input_list."""

    # Yield the length of a string
    for person in input_list:
        yield(len(person))

print(get_lengths(lannister))

# Print the values generated by get_lengths()
for value in get_lengths(lannister):
    print(value)

<generator object get_lengths at 0x7f6ad8db53c0>
6
5
5
6
7
