In [3]:
pip show tweepy

Name: tweepy
Version: 4.12.1
Summary: Twitter library for Python
Home-page: https://www.tweepy.org/
Author: Joshua Roesslein
Author-email: tweepy@googlegroups.com
License: MIT
Location: d:\python\lib\site-packages
Requires: requests, requests-oauthlib, oauthlib
Required-by: 
Note: you may need to restart the kernel to use updated packages.


In [4]:
# for the twitter section
import tweepy
import os
import datetime
import re
from pprint import pprint

# for the lyrics scrape section
import requests
import time
from bs4 import BeautifulSoup
from collections import defaultdict, Counter

import random 

In [5]:
import shutil
import pandas as pd
from tweepy import Cursor

In [6]:
from api_keys  import api_key, api_key_secret, bearer_token

In [7]:
client = tweepy.Client(bearer_token,wait_on_rate_limit=True)

In [8]:
auth = tweepy.AppAuthHandler(api_key, api_key_secret)
api = tweepy.API(auth)

**Testing the API**

The Twitter APIs are quite rich. Let's play around with some of the features before we dive into this section of the assignment. For our testing, it's convenient to have a small data set to play with. We will seed the code with the handle of John Chandler, one of the instructors in this course. His handle is @37chandler. Feel free to use a different handle if you would like to look at someone else's data.

We will write code to explore a few aspects of the API:

Pull some of the followers @37chandler.

Explore response data, which gives us information about Twitter users.

Pull the last few tweets by @37chandler.

In [10]:
handle = "37chandler"
user_obj = client.get_user(username=handle)

followers = client.get_users_followers(
    # Learn about user fields here: 
    # https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/user
    user_obj.data.id, user_fields=["created_at","description","location",
                                   "public_metrics"]
)

In [11]:
num_to_print = 5

for idx, user in enumerate(followers.data) :
    following_count = user.public_metrics['following_count']
    followers_count = user.public_metrics['followers_count']
    
    print(f"{user.name} lists '{user.location}' as their location.")
    print(f" Following: {following_count}, Followers: {followers_count}.")
    print(user.id)
    
    if idx >= (num_to_print - 1) :
        break
    

John chandler lists 'Decatur, GA' as their location.
 Following: 130, Followers: 10.
1609015684796715008
Frank P Seidl lists 'Twin Cities, Minnesota USA' as their location.
 Following: 37863, Followers: 37562.
3334369960
Roberta lists 'Salinas' as their location.
 Following: 1895, Followers: 182.
1569173836419186689
Anna bikes MKE lists 'mke ' as their location.
 Following: 2287, Followers: 1756.
14240035
Catherine lists 'San Angelo' as their location.
 Following: 2193, Followers: 225.
1570118574999588864


In [12]:
max_followers = 0

for idx, user in enumerate(followers.data) :
    followers_count = user.public_metrics['followers_count']
    
    if followers_count > max_followers :
        max_followers = followers_count
        max_follower_user = user

        
print(max_follower_user)
print(max_follower_user.public_metrics)

SpaceConscious
{'followers_count': 37562, 'following_count': 37863, 'tweet_count': 13957, 'listed_count': 305}


Let's pull some more user fields and take a look at them. The fields can be specified in the user_fields argument.

In [13]:
response = client.get_user(id=user_obj.data.id,
                          user_fields=["created_at","description","location",
                                       "entities","name","pinned_tweet_id","profile_image_url",
                                       "verified","public_metrics"])

In [14]:
for field, value in response.data.items() :
    print(f"for {field} we have {value}")

for username we have 37chandler
for description we have He/Him. Data scientist, urban cyclist, educator, erstwhile frisbee player. 

¯\_(ツ)_/¯
for location we have MN
for public_metrics we have {'followers_count': 185, 'following_count': 592, 'tweet_count': 1049, 'listed_count': 3}
for verified we have False
for created_at we have 2009-04-18 22:08:22+00:00
for id we have 33029025
for profile_image_url we have https://pbs.twimg.com/profile_images/2680483898/b30ae76f909352dbae5e371fb1c27454_normal.png
for name we have John Chandler


***
Now a few questions for you about the user object.

Q: How many fields are being returned in the response object?

A: 9

***
Q: Are any of the fields within the user object non-scalar? (I.e., more complicated than a simple data type like integer, float, string, boolean, etc.)

A: NO

***
Q: How many friends, followers, and tweets does this user have?

A: followers_count': 183, 'following_count': 592, 'tweet_count': 1049

***

In [15]:
response = client.get_users_tweets(user_obj.data.id)

# By default, only the ID and text fields of each Tweet will be returned
for idx, tweet in enumerate(response.data) :
    print(tweet.id)
    print(tweet.text)
    print()
    
    if idx > 10 :
        break

1615204718678007808
RT @paulisci: A Brief History of Shorter Work Weeks Are Coming

🧵

1611545485029810180
Happy Dia de los Reyes to all who celebrate it. https://t.co/4G7zAuwC70

1608230093071212544
RT @year_progress: ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 99%

1606038920604499969
RT @CoachBalto: This video is perfect. Parents please watch. Part 1/2 https://t.co/NvcBFmyFPO

1602407567036190743
RT @LindsayMasland: I had the realization that "grades are pretend" the first time I taught (as a TA).  

I was grading something with both…

1598645130075856896
RT @marinaendicott: My new favourite lawyer’s letter, just for the sheer joy of the tone.

1598156055997222912
RT @_TanHo: Hey friends, #AdventOfCode starts TONIGHT! I've organized a friendly leaderboard every year for the #rstats (and friends) commu…

1597746144108740608
If you like biking and not getting hit by muederboxes, you should consider one of these. https://t.co/0prMLbvj3b

1597734124927995904
RT @CraigTheDev: A lot of people argue that AI art isn't

***
**Pulling Follower Information**

In this next section of the assignment, we will pull information about the followers of your two artists. We've seen above how to pull a set of followers using client.get_users_followers. This function has a parameter, max_results, that we can use to change the number of followers that we pull. Unfortunately, we can only pull 1000 followers at a time, which means we will need to handle the pagination of our results.

The return object has the .data field, where the results will be found. It also has .meta, which we use to select the next "page" in the results using the next_token result. I will illustrate the ideas using our user from above.

**Rate Limiting**

Twitter limits the rates at which we can pull data, as detailed in this guide. We can make 15 user requests per 15 minutes, meaning that we can pull  users per hour. I illustrate the handling of rate limiting below, though whether or not you hit that part of the code depends on your value of handle.

In the below example, I'll pull all the followers, 25 at a time. (We're using 25 to illustrate the idea; when you do this set the value to 1000.)

In [16]:
handle_followers = []
pulls = 0
max_pulls = 100
next_token = None

while True :

    followers = client.get_users_followers(
        user_obj.data.id, 
        max_results=1000, # when you do this for real, set this to 1000!
        pagination_token = next_token,
        user_fields=["created_at","description","location",
                     "entities","name","pinned_tweet_id","profile_image_url",
                     "verified","public_metrics"]
    )
    pulls += 1
    
    for follower in followers.data : 
        follower_row = (follower.id,follower.name,follower.created_at,follower.description)
        handle_followers.append(follower_row)
    
    if 'next_token' in followers.meta and pulls < max_pulls :
        next_token = followers.meta['next_token']
    else : 
        break

**Pulling Twitter Data for Your Artists**

Now let's take a look at your artists and see how long it is going to take to pull all their followers.

In [17]:
artists = dict()

for handle in ['sanbenito','shakira'] : 
    user_obj = client.get_user(username=handle,user_fields=["public_metrics"])
    artists[handle] = (user_obj.data.id, 
                       handle,
                       user_obj.data.public_metrics['followers_count'])
    

for artist, data in artists.items() : 
    print(f"It would take {data[2]/(1000*15*4):.2f} hours to pull all {data[2]} followers for {artist}. ")
    


It would take 82.85 hours to pull all 4970910 followers for sanbenito. 
It would take 893.86 hours to pull all 53631489 followers for shakira. 


***
Depending on what you see in the display above, you may want to limit how many followers you pull. It'd be great to get at least 200,000 per artist.

As we pull data for each artist we will write their data to a folder called "twitter", so we will make that folder if needed.

In [18]:
# Make the "twitter" folder here. If you'd like to practice your programming, add functionality 
# that checks to see if the folder exists. If it does, then "unlink" it. Then create a new one.

if not os.path.isdir("twitter") : 
    #shutil.rmtree("twitter/")
    os.mkdir("twitter")

In this following cells, build on the above code to pull some of the followers and their data for your two artists. As you pull the data, write the follower ids to a file called [artist name]_followers.txt in the "twitter" folder. For instance, for Cher I would create a file named cher_followers.txt. As you pull the data, also store it in an object like a list or a data frame.

In addition to creating a file that only has follower IDs in it, you will create a file that includes user data. From the response object please extract and store the following fields:


screen_name
name
id
location
followers_count
friends_count
description


Store the fields with one user per row in a tab-delimited text file with the name [artist name]_follower_data.txt. For instance, for Cher I would create a file named cher_follower_data.txt.

One note: the user's description can have tabs or returns in it, so make sure to clean those out of the description before writing them to the file. I've included some example code to do that below the stub.

In [19]:
num_followers_to_pull =100*2 # feel free to use this to limit the number of followers you pull.


In [20]:
  # Using tweepy.Paginator (https://docs.tweepy.org/en/latest/v2_pagination.html), 
    # use `get_users_followers` to pull the follower data requested. 

    # For each response object, extract the needed fields and store them in a dictionary or
    # data frame. 

    # I recommend writing your results for every response. This isn't the most efficient option
    # (since you're opening and closing the file regularly), but it ensures that your 
    # work is saved in case there is an issue with the API connection. 
    
    # If you've pulled num_followers_to_pull, feel free to break out paged twitter API response

In [22]:
folder_name = "twitter"
handles = ['sanbenito','shakira']

whitespace_pattern = re.compile(r"\s+")

user_data = dict() 
followers_data = dict()

for handle in handles:
    user_data[handle] = [] # will be a list of lists
    followers_data[handle] = [] # will be a simple list of IDs
    
# Grabs the time when we start making requests to the API
start_time = datetime.datetime.now()

for handle in handles:
    
    
    # Create the output file names 
    followers_output_file = handle + "_followers.txt"
    user_data_output_file = handle + "_id_follower_data.txt"
    
    # Using tweepy.Paginator (https://docs.tweepy.org/en/latest/v2_pagination.html), 
    # use `get_users_followers` to pull the follower data requested. 
    
    user_obj = client.get_user(username = handle)
    followers_id = {id: []}
        
    followers = client.get_users_followers(
    user_obj.data.id,pagination_token= next_token,
        user_fields = ["username", "description",
                       "name", "id", "location", "public_metrics",],)

    user_fields = {"screen_name": [],
                        "name": [],
                        "id": [],
                        "location": [],
                        "followers_count": [],
                        "friends_count":[],
                        "description": []
                       }
    # For each response object, extract the needed fields and store them in a dictionary or
    # data frame. 

    for idx, user in enumerate(followers.data):
        user_fields["screen_name"].append(user.username),
        user_fields["name"].append(user.name),
        user_fields["id"].append(user.id),
        followers_id[id].append(user.id),
        user_fields["location"].append(user.location),
        
        user_fields["description"].append(user.description),
        followers_count= user.public_metrics["followers_count"]
        
        user_fields["followers_count"].append(followers_count),
        following_count = user.public_metrics["following_count"]
        
        user_fields["friends_count"].append(following_count),
          
    # I recommend writing your results for every response. This isn't the most efficient option
    # (since you're opening and closing the file regularly), but it ensures that your 
    # work is saved in case there is an issue with the API connection.   

    followers_id_df = pd.DataFrame(followers_id)
    followers_data_df = pd.DataFrame(user_fields)
    
    

    folder_path = os.path.join(os.getcwd(), folder_name)
    if not os.path.exists(folder_path):
        os.mkdir(folder_path)

    handle_folder_path = os.path.join(folder_path, handle)
    if not os.path.exists(handle_folder_path):
        os.mkdir(handle_folder_path)

    followers_output_file_path = os.path.join(handle_folder_path, followers_output_file)
    user_data_output_file_path = os.path.join(handle_folder_path, user_data_output_file)
    

    with open(followers_output_file_path, "w", encoding='utf-8') as output_file1:
        output_file1.write(followers_id_df.to_string())
        
    with open (user_data_output_file_path, "w", encoding='utf-8') as output_file2:
        output_file2.write(followers_data_df.to_string())
        
    print(f"File '{followers_output_file_path}' created")
    print(f"File '{user_data_output_file_path}' created")
    
    
    # Let's see how long it took to grab all follower IDs
    end_time = datetime.datetime.now()
    print(end_time - start_time)


File 'C:\Users\Luis Perez\Documents\twitter\sanbenito\sanbenito_followers.txt' created
File 'C:\Users\Luis Perez\Documents\twitter\sanbenito\sanbenito_id_follower_data.txt' created
0:00:00.449668
File 'C:\Users\Luis Perez\Documents\twitter\shakira\shakira_followers.txt' created
File 'C:\Users\Luis Perez\Documents\twitter\shakira\shakira_id_follower_data.txt' created
0:00:00.728225


In [41]:
 user_fields["location"]
location_list = user_fields["location"]
print(f"For {artist} we have {len(location_list)} unique locations.")

For sanbenito we have 100 unique locations.


**I was able to print/grab 100 twitter user's information, but when tried to grab more with a loop I would get timeout**

**below my attempt**

******
**Attempt to run with Loop**
```python
counter = 0
folder_name = "twitter"
handles = ['sanbenito','shakira']

whitespace_pattern = re.compile(r"\s+")

user_data = dict() 
followers_data = dict()

for handle in handles:
    user_data[handle] = [] # will be a list of lists
    followers_data[handle] = [] # will be a simple list of IDs
    
# Grabs the time when we start making requests to the API
start_time = datetime.datetime.now()


for handle in handles:
    
        
    # Create the output file names 
    followers_output_file = handle + "_followers.txt"
    user_data_output_file = handle + "_id_follower_data.txt"
    
       
    user_obj = client.get_user(username = handle)
    followers_id = {id: []}
    
    while len(followers_id) < num_followers_to_pull:
        followers = client.get_users_followers(
        user_obj.data.id,
        pagination_token= next_token,
            user_fields = ["username", "description", "name", "id", "location", "public_metrics",],)

       

        twitter_data = {"screen_name": [],
                            "name": [],
                            "id": [],
                            "location": [],
                            "followers_count": [],
                            "friends_count":[],
                            "description": []
                           }
        for idx, user in enumerate(followers.data):
            twitter_data["screen_name"].append(user.username),
            twitter_data["name"].append(user.name),
            twitter_data["id"].append(user.id),
            followers_id[id].append(user.id),
            twitter_data["location"].append(user.location),
            twitter_data["description"].append(user.description),
            followers_count= user.public_metrics["followers_count"]

            twitter_data["followers_count"].append(followers_count),
            following_count = user.public_metrics["following_count"]

            twitter_data["friends_count"].append(following_count),





        followers_id_df = pd.DataFrame(followers_id)
        followers_data_df = pd.DataFrame(twitter_data)

        append new pages
        updated_followers_id.append(followers_id_df)
        updated_followers_data.append(followers_data_df)
    
    
    if len(followers_id_df) >= num_followers_to_pull or next_token is None:
        break
    

    folder_path = os.path.join(os.getcwd(), folder_name)
    if not os.path.exists(folder_path):
        os.mkdir(folder_path)

    handle_folder_path = os.path.join(folder_path, handle)
    if not os.path.exists(handle_folder_path):
        os.mkdir(handle_folder_path)

    followers_output_file_path = os.path.join(handle_folder_path, followers_output_file)
    user_data_output_file_path = os.path.join(handle_folder_path, user_data_output_file)
    

    with open(followers_output_file_path, "w", encoding='utf-8') as output_file1:
        output_file1.write(followers_id_df.to_string())
       
        
    #with open (followers_output_file_path, "w", encoding='utf-8') as output_file2:
        #output_file2.write(followers_data_df.to_string())
        
    with open (user_data_output_file_path, "w", encoding='utf-8') as output_file2:
        output_file2.write(followers_data_df.to_string())
        
    print(f"File '{followers_output_file_path}' created")
    print(f"File '{user_data_output_file_path}' created")
    
    
    # Let's see how long it took to grab all follower IDs
    end_time = datetime.datetime.now()
    print(end_time - start_time)
    
```


******

In [25]:
#artists = {'robyn':"https://www.azlyrics.com/r/robyn.html",
  #         'cher':"https://www.azlyrics.com/c/cher.html"} 

artists = {'badbunny':"https://www.azlyrics.com/b/badbunny.html",
           'shakira':"https://www.azlyrics.com/s/shakira.html"}

# we'll use this dictionary to hold both the artist name and the link on AZlyrics

A Note on Rate Limiting

The lyrics site, www.azlyrics.com, does not have an explicit maximum on number of requests in any one time, but in our testing it appears that too many requests in too short a time will cause the site to stop returning lyrics pages. (Entertainingly, the page that gets returned seems to only have the song title to a Tom Jones song.)

Whenever you call requests.get to retrieve a page, put a time.sleep(5 + 10*random.random()) on the next line. This will help you not to get blocked. If you do get blocked, which you can identify if the returned pages are not correct, just request a lyrics page through your browser. You'll be asked to perform a CAPTCHA and then your requests should start working again.

Part 1: Finding Links to Songs Lyrics
That general artist page has a list of all songs for that artist with links to the individual song pages.
***
Q: Take a look at the robots.txt page on www.azlyrics.com. (You can read more about these pages here.) Is the scraping we are about to do allowed or disallowed by this page? How do you know?

A: There are multiple ways to check the website robots.txt file, one way is to add "/robots.txt at the end of the website. For azlyrics.com we are allowed to scrape except for these folders "/lyricsdb/ and /song/". 

***

In [26]:
lyrics_pages = defaultdict(list)

for artist, artist_page in artists.items():
    # request the page and sleep
    r = requests.get(artist_page)
    time.sleep(5 + 10*random.random())

    # now extract the links to lyrics pages from this page
    soup = BeautifulSoup(r.content, 'html.parser')

    # find all the links on the page
    links = soup.find_all('a', href=lambda x: x and x.startswith('/lyrics'))
    links = [link.get('href') for link in links]
   
    # store the links in the dictionary
    lyrics_pages[artist] = links

In [28]:
for artist, lp in lyrics_pages.items() :
    assert(len(set(lp)) > 20)

In [29]:
# Let's see how long it's going to take to pull these lyrics 
# if we're waiting `5 + 10*random.random()` seconds 
for artist, links in lyrics_pages.items() : 
    print(f"For {artist} we have {len(links)}.")
    print(f"The full pull will take for this artist will take {round(len(links)*10/3600,2)} hours.")

For badbunny we have 143.
The full pull will take for this artist will take 0.4 hours.
For shakira we have 157.
The full pull will take for this artist will take 0.44 hours.


**Part 2: Pulling Lyrics**

Now that we have the links to our lyrics pages, let's go scrape them! Here are the steps for this part.

1.Create an empty folder in our repo called "lyrics".

2.Iterate over the artists in lyrics_pages.

3.Create a subfolder in lyrics with the artist's name. For instance, if the artist was Cher you'd have lyrics/cher/ in your repo.

4.Iterate over the pages.

5.Request the page and extract the lyrics from the returned HTML file using BeautifulSoup.

6.Use the function below, generate_filename_from_url, to create a filename based on the lyrics page, then write the lyrics to a text file with that name.

In [30]:
def generate_filename_from_link(link) :
    
    if not link :
        return None
    
    # drop the http or https and the html
    name = link.replace("https","").replace("http","")
    name = link.replace(".html","")

    name = name.replace("/lyrics/","")
    
    # Replace useless chareacters with UNDERSCORE
    name = name.replace("://","").replace(".","_").replace("/","_")
    
    # tack on .txt
    name = name + ".txt"
    
    return(name)

In [31]:
# Make the lyrics folder here. If you'd like to practice your programming, add functionality 
# that checks to see if the folder exists. If it does, then use shutil.rmtree to remove it and create a new one.

if os.path.isdir("lyrics") : 
    shutil.rmtree("lyrics/")

os.mkdir("lyrics")

In [33]:
url_stub = "https://www.azlyrics.com" 
start = time.time()

total_pages = 0


for artist in lyrics_pages:
    
    # create a folder with the artist name
    sub_artist_folder = os.path.join("lyrics", artist)  
    if not os.path.exists(sub_artist_folder):
        os.makedirs(sub_artist_folder)
    
    
    for lyrics_page in lyrics_pages[artist]:
        url = url_stub + lyrics_page
        r = requests.get(url)
        time.sleep(5+10*random.random())
        
        soup = BeautifulSoup(r.content, 'html.parser')
        lyrics= soup.find(class_='col-xs-12 col-lg-8 text-center').text
    
        
        div = soup.find('div', class_='col-xs-12 col-lg-8 text-center')
        long_way = div.find_all('b')
        title =long_way[1].text
        

        #lyrics = re.sub(r"\s+", "", lyrics, flags=re.UNICODE)
        
        
        # 5. Write out the title, two returns ('\n'), and the lyrics. Use `generate_filename_from_url` to generate the filename. 
        filename = generate_filename_from_link(lyrics_page)
        
        with open(os.path.join(sub_artist_folder, filename,), "w",encoding='utf-8') as f:
            f.write(title + '\n')
            f.write(lyrics)
            total_pages += 1
            
      
            
            
            print(f'{total_pages} pages scraped')
            end = time.time()
            print(f'Elapsed time: {end - start}')
            # Remember to pull at least 20 songs per artist. It may be fun to pull all the songs for the artist
        
       


    


1 pages scraped
Elapsed time: 7.157498121261597
2 pages scraped
Elapsed time: 13.636198997497559
3 pages scraped
Elapsed time: 28.105494737625122
4 pages scraped
Elapsed time: 35.974318742752075
5 pages scraped
Elapsed time: 42.315120220184326
6 pages scraped
Elapsed time: 52.01504135131836
7 pages scraped
Elapsed time: 61.95275640487671
8 pages scraped
Elapsed time: 70.26995992660522
9 pages scraped
Elapsed time: 80.73064970970154
10 pages scraped
Elapsed time: 92.25739407539368
11 pages scraped
Elapsed time: 100.60347175598145
12 pages scraped
Elapsed time: 111.7222192287445
13 pages scraped
Elapsed time: 127.085280418396
14 pages scraped
Elapsed time: 141.03794813156128
15 pages scraped
Elapsed time: 155.30422830581665
16 pages scraped
Elapsed time: 169.69862341880798
17 pages scraped
Elapsed time: 175.97265934944153
18 pages scraped
Elapsed time: 185.95186161994934
19 pages scraped
Elapsed time: 193.76842403411865
20 pages scraped
Elapsed time: 204.6372787952423
21 pages scraped
El

In [34]:
print(f"Total run time was {round((time.time() - start)/3600,2)} hours.")

Total run time was 1.14 hours.


**Evaluation**

This assignment asks you to pull data from the Twitter API and scrape www.AZLyrics.com. After you have finished the above sections , run all the cells in this notebook. Print this to PDF and submit it, per the instructions.

In [35]:
def words(text): 
    return re.findall(r'\w+', text.lower())

******

**Checking Twitter Data**

The output from your Twitter API pull should be two files per artist, stored in files with formats like cher_followers.txt (a list of all follower IDs you pulled) and cher_followers_data.txt. These files should be in a folder named twitter within the repository directory. This code summarizes the information at a high level to help the instructor evaluate your work.

In [24]:
twitter_files = os.listdir("twitter")
twitter_files = [f for f in twitter_files if f != ".DS_Store"]
artist_handles = list(set([name.split("_")[0] for name in twitter_files]))

print(f"We see two artist handles: {artist_handles[0]} and {artist_handles[1]}.")

We see two artist handles: shakira and sanbenito.


In [46]:
for artist in artist_handles :
    follower_file = artist + "_followers.txt"
    follower_data_file = artist + "_id_follower_data.txt"
    
    ids = open("twitter/" + artist +"/"+ follower_file,'r').readlines()
    
    print(f"We see {len(ids)-1} in your follower file for {artist}, assuming a header row.")
    
    with open("twitter/" + artist +"/"+ follower_data_file,'r', encoding='utf-8') as infile :
        
        # check the headers
        headers = infile.readline().split("\t")
        
        print(f"In the follower data file ({follower_data_file}) for {artist}, we have these columns:")
        print(" : ".join(headers))
        
        description_words = []
        locations = set()
        
        
        for idx, line in enumerate(infile.readlines()) :
            line = line.strip("\n").split("\t")
            
            try : 
                locations.add(line[3])            
                description_words.extend(words(line[6]))
            except :
                pass
    
        

        print(f"We have {idx+1} data rows for {artist} in the follower data file.")

        #print(f"For {artist} we have {len(locations)} unique locations.")
        #change the way location and description is checked
        location_list = user_fields["location"]
        print(f"For {artist} we have {len(location_list)} unique locations.")
        
        #print(f"For {artist} we have {len(description_words)} words in the descriptions.")
        description_words = user_fields["description"]
        print(f"For {artist} we have {len(description_words)} words in the descriptions..")

       
        print("Here are the five most common words:")
        print(Counter(description_words).most_common(5))

        
        print("")
        print("-"*40)
        print("")
    

We see 100 in your follower file for shakira, assuming a header row.
In the follower data file (shakira_id_follower_data.txt) for shakira, we have these columns:
        screen_name                              name                   id                        location  followers_count  friends_count                                                                                                                                                      description

We have 100 data rows for shakira in the follower data file.
For shakira we have 100 unique locations.
For shakira we have 100 words in the descriptions..
Here are the five most common words:
[('', 62), ('Artista visual', 1), ('Bióloga, Magistra en Gestión Ambiental. Escribiendo historias.... Interesada en Cambio Climático y Soluciones Basadas en Naturaleza', 1), ('Hola', 1), ('Every winter has its spring', 1)]

----------------------------------------

We see 100 in your follower file for sanbenito, assuming a header row.
In the f

******
**Checking Lyrics**


The output from your lyrics scrape should be stored in files located in this path from the directory: /lyrics/[Artist Name]/[filename from URL]. This code summarizes the information at a high level to help the instructor evaluate your work.

In [37]:
artist_folders = os.listdir("lyrics/")
artist_folders = [f for f in artist_folders if os.path.isdir("lyrics/" + f)]

for artist in artist_folders : 
    artist_files = os.listdir("lyrics/" + artist)
    artist_files = [f for f in artist_files if 'txt' in f or 'csv' in f or 'tsv' in f]

    print(f"For {artist} we have {len(artist_files)} files.")

    artist_words = []

    for f_name in artist_files : 
        with open("lyrics/" + artist + "/" + f_name, encoding="utf8") as infile : 
            artist_words.extend(words(infile.read()))

            
    print(f"For {artist} we have roughly {len(artist_words)} words, {len(set(artist_words))} are unique.")

For badbunny we have 143 files.
For badbunny we have roughly 81057 words, 7992 are unique.
For shakira we have 157 files.
For shakira we have roughly 94469 words, 8169 are unique.
