En este archivo puedes escribir lo que estimes conveniente. Te recomendamos detallar tu solución y todas las suposiciones que estás considerando. Aquí puedes ejecutar las funciones que definiste en los otros archivos de la carpeta src, medir el tiempo, memoria, etc.

In [6]:
file_path = "farmers-protest-tweets-2021-2-4.json"

# 1. Initializing PySpark

## Here, we're going to use PySpark as our framework to handle tweets json file

In [1]:
from pyspark.sql import SparkSession
from pyspark import SparkContext

In [2]:
import findspark
findspark.init()

In [3]:
spark = SparkSession.builder.appName('latam_challenge').getOrCreate()

In [4]:
spark

# 2. Reading Tweets JSON File

##  The goal here is to check file structure and content from a dataframe preview

In [7]:
df = spark.read.json(file_path)

In [8]:
df.printSchema()

root
 |-- content: string (nullable = true)
 |-- conversationId: long (nullable = true)
 |-- date: string (nullable = true)
 |-- id: long (nullable = true)
 |-- lang: string (nullable = true)
 |-- likeCount: long (nullable = true)
 |-- media: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- duration: double (nullable = true)
 |    |    |-- fullUrl: string (nullable = true)
 |    |    |-- previewUrl: string (nullable = true)
 |    |    |-- thumbnailUrl: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- variants: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- bitrate: long (nullable = true)
 |    |    |    |    |-- contentType: string (nullable = true)
 |    |    |    |    |-- url: string (nullable = true)
 |-- mentionedUsers: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- created: string (nullable = true)
 |    |   

In [16]:
df.show(5)

+--------------------+-------------------+--------------------+-------------------+----+---------+--------------------+--------------------+--------------------+----------+--------------------+--------------------+----------+------------+--------------+--------------------+-------------------+--------------------+--------------------+--------------------+--------------------+
|             content|     conversationId|                date|                 id|lang|likeCount|               media|      mentionedUsers|            outlinks|quoteCount|         quotedTweet|     renderedContent|replyCount|retweetCount|retweetedTweet|              source|        sourceLabel|           sourceUrl|         tcooutlinks|                 url|                user|
+--------------------+-------------------+--------------------+-------------------+----+---------+--------------------+--------------------+--------------------+----------+--------------------+--------------------+----------+------------+----

# 3. Solving problems

# 3.0 Importing output data type and performance libraries

In [40]:
# Output data type
from typing import List, Tuple
from datetime import datetime

# Transformation
from pyspark.sql.functions import to_date, col

# Time performance
from cProfile import Profile
from pstats import SortKey, Stats

## 3.1 Time -> Las top 10 fechas donde hay más tweets. Mencionar el usuario (username) que más publicaciones tiene por cada uno de esos días

In [19]:
q1_main_df = df.select("date","user.username")

In [45]:
def q1_time(file_path: str):# -> List[Tuple[datetime.date, str]]:
    tweet_list_df = spark.read.json(file_path)
    
    tweet_list_df = tweet_list_df \
    .withColumn("parsed_date", to_date("date")) \
    .select(col("parsed_date").alias("tweet_dt"),col("user.username").alias("user_name")) \
    
    tweet_grp_df = tweet_list_df \
    .groupBy("tweet_dt", "user_name").count()
    
    print(tweet_grp_df.show(truncate=False))

In [46]:
with Profile() as profile:
    q1_time(file_path)
    (
     Stats(profile)
     .strip_dirs()
     .sort_stats(SortKey.CALLS)
     .print_stats()
    )

+----------+---------------+-----+
|tweet_dt  |user_name      |count|
+----------+---------------+-----+
|2021-02-24|BumblebeeUmeed |1    |
|2021-02-24|htTweets       |2    |
|2021-02-24|v_sanjai       |1    |
|2021-02-24|dr_sonia27     |1    |
|2021-02-24|RamneetMann4   |1    |
|2021-02-24|JPSinghRuhil   |2    |
|2021-02-24|BjpReporting   |1    |
|2021-02-24|manesh67726670 |1    |
|2021-02-24|AKulvir        |1    |
|2021-02-24|HAchahal       |1    |
|2021-02-24|AnuragVerma_SP |1    |
|2021-02-24|Gurwind33930102|3    |
|2021-02-23|TechieKisaan   |2    |
|2021-02-23|MindsetMatter1 |1    |
|2021-02-23|Nimratkhaira_  |1    |
|2021-02-23|Lats_tweets    |1    |
|2021-02-23|Deshbha99450233|1    |
|2021-02-23|SecondEye5     |1    |
|2021-02-23|alieshan09     |1    |
|2021-02-23|SikhWhite      |1    |
+----------+---------------+-----+
only showing top 20 rows

None
         3606 function calls (3589 primitive calls) in 3.894 seconds

   Ordered by: call count

   ncalls  tottime  percall  cum

## 3.2 Memory -> Las top 10 fechas donde hay más tweets. Mencionar el usuario (username) que más publicaciones tiene por cada uno de esos días