# 1. Data Collection

**Data collection** is the process of gathering data from different sources for analysis and processing. \In the context of machine learning and natural language processing, collecting relevant and diverse data is crucial for training accurate and robust models.\In this code snippet, we collect text data from Twitter for three different languages (Darija, French, and English) by searching specific queries for each language and saving the collected tweets into a CSV file.

### 1.  Install `snscrape` library using pip:

In [1]:
! pip install snscrape



### 2.  Import required libraries: 

In [2]:
import pandas as pd
import time
import snscrape.modules.twitter as sntwitter

### 3.  Define a function to search for tweets using specific queries:

In [3]:
def search_by_query(query):     
    return sntwitter.TwitterSearchScraper(query).get_items()

### 4.  Define a function to collect tweets for a given language and list of queries:

In [5]:
def get_tweets(maxTweets, language, queries):     
    tweets = []     
    for query in queries:         
        for i, tweet in enumerate(search_by_query(query)):             
            tweets.append({'language': language, 'content': tweet.rawContent})             
            if i >= maxTweets:                 
                break     
    return tweets

### 5.  Set maximum number of tweets to be collected and define the queries for each language:

In [9]:
maxTweets = 3000 
queries = {'darija': ['montakhab', 'darija'],
           'french': ['amour', 'cinéma'],
           'english': ['technology', 'quotes', ] }

### 6.  Collect tweets for each language and append them to a list:

In [None]:
tweets = [] 
for language, language_queries in queries.items():     
    tweets += get_tweets(maxTweets, language, language_queries)

### 7.  Print number of collected tweets and time taken for scraping:

In [None]:
print(f"Collected {len(tweets)} tweets in {end - start:.2f} seconds.")

### 8.  Transform the tweets into a Pandas dataframe and save it to a CSV file:

In [None]:
df = pd.DataFrame(tweets) df.to_csv('data.csv', index=False, encoding='utf-8')