# Carga Profilers
Se cargan los profilers y se crea una magic function que permite escribir y correr las definiciones de las funciones para facilitar las mediciones con los perfiladores cargados en memoria.

In [130]:
%load_ext memory_profiler

The memory_profiler extension is already loaded. To reload it, use:
  %reload_ext memory_profiler


In [131]:
%load_ext line_profiler

The line_profiler extension is already loaded. To reload it, use:
  %reload_ext line_profiler


In [132]:
from IPython.core.magic import register_cell_magic


@register_cell_magic
def write_and_run(line, cell):
    argz = line.split()
    file = argz[-1]
    mode = 'w'
    if len(argz) == 2 and argz[0] == '-a':
        mode = 'a'
    with open(file, mode) as f:
        f.write(cell)
    get_ipython().run_cell(cell)

In [133]:
%%write_and_run test_func.py
def testing_func():
    x=[]
    for i in range (2):
        x+=[i]

In [134]:
 
%lprun -f testing_func testing_func()
from test_func import testing_func
%mprun -f testing_func testing_func()




Filename: c:\Users\Juantc93\DEV\latam_test\test_func.py

Line #    Mem usage    Increment  Occurrences   Line Contents
     1    733.9 MiB    733.9 MiB           1   def testing_func():
     2    733.9 MiB      0.0 MiB           1       x=[]
     3    733.9 MiB      0.0 MiB           3       for i in range (2):
     4    733.9 MiB      0.0 MiB           2           x+=[i]

## Constants

In [72]:
FILE_PATH="tweets.json.zip"

En general la estrategía de optimización consitió en utilizar librerías de más bajo nivel para la lectura de los datos serializados, haciendo uso de tan solo los estrictamente necesarios para la optimización memoria.

En terminos generales el caso base con la librería pandas generó tiempos de ejecución ligeramente inferiores en comparación al manejo con ujson.

Desafortunadamente no pude obtener una optimización en memoria para la tarea 1, no obstante presento el caso base.

## Twiteros del día

In [135]:
%%write_and_run q1_base.py
def q1_base(file_path):
    import pandas as pd
    df = pd.read_json(file_path, lines=True)\
        .assign(date=lambda x: x.date.dt.date,
                username=lambda x: x.user.apply(lambda y: y.get("username")))

    df_top_ten_date=df["date"]\
        .value_counts()\
        .nlargest(10)\
        .rename("twits")\
        .rename_axis("date")


    return df.loc[:,["date","username"]]\
        .set_index("date")\
        .join(df_top_ten_date,how="inner")\
        .set_index("username",append=True)\
        .groupby(["date","username","twits"])\
        .agg(twits_user=pd.NamedAgg(column="twits",aggfunc="count"))\
        .assign(
            rank=lambda x: x.groupby(["date","twits"])["twits_user"].rank(axis='index',method="first",ascending=False))\
        .query("rank==1")\
        .sort_index(level="twits",ascending=False)\
        .reset_index()\
        .loc[:,["date","username"]]\
        .to_records(index=False).tolist()



In [136]:
q1_base(FILE_PATH)

[(datetime.date(2021, 2, 12), 'RanbirS00614606'),
 (datetime.date(2021, 2, 13), 'MaanDee08215437'),
 (datetime.date(2021, 2, 17), 'RaaJVinderkaur'),
 (datetime.date(2021, 2, 16), 'jot__b'),
 (datetime.date(2021, 2, 14), 'rebelpacifist'),
 (datetime.date(2021, 2, 18), 'neetuanjle_nitu'),
 (datetime.date(2021, 2, 15), 'jot__b'),
 (datetime.date(2021, 2, 20), 'MangalJ23056160'),
 (datetime.date(2021, 2, 23), 'Surrypuria'),
 (datetime.date(2021, 2, 19), 'Preetm91')]

In [137]:
%lprun -f q1_base q1_base(FILE_PATH)

Timer unit: 1e-07 s

Total time: 8.76166 s
File: C:\Users\Juantc93\AppData\Local\Temp\ipykernel_8900\1763400204.py
Function: q1_base at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
     1                                           def q1_base(file_path):
     2         1         20.0     20.0      0.0      import pandas as pd
     3         2   83789360.0    4e+07     95.6      df = pd.read_json(file_path, lines=True)\
     4         1         33.0     33.0      0.0          .assign(date=lambda x: x.date.dt.date,
     5         1         11.0     11.0      0.0                  username=lambda x: x.user.apply(lambda y: y.get("username")))
     6                                           
     7         4     100920.0  25230.0      0.1      df_top_ten_date=df["date"]\
     8                                                   .value_counts()\
     9         1          3.0      3.0      0.0          .nlargest(10)\
    10         1          4.0      4.0      0.0     

In [138]:

from q1_base import q1_base

print(r"\n\n --- MEMORY_PROFILING ---\nn")
%mprun -f q1_base q1_base(FILE_PATH)

\n\n --- MEMORY_PROFILING ---\nn



Filename: c:\Users\Juantc93\DEV\latam_test\q1_base.py

Line #    Mem usage    Increment  Occurrences   Line Contents
     1    755.9 MiB    755.9 MiB           1   def q1_base(file_path):
     2    755.9 MiB      0.0 MiB           1       import pandas as pd
     3   2018.1 MiB   1241.7 MiB           2       df = pd.read_json(file_path, lines=True)\
     4   2035.2 MiB     20.6 MiB           3           .assign(date=lambda x: x.date.dt.date,
     5   2035.2 MiB      0.0 MiB      234817                   username=lambda x: x.user.apply(lambda y: y.get("username")))
     6                                         
     7   2018.1 MiB      0.0 MiB           4       df_top_ten_date=df["date"]\
     8                                                 .value_counts()\
     9   2018.1 MiB      0.0 MiB           1           .nlargest(10)\
    10   2018.1 MiB      0.0 MiB           1           .rename("twits")\
    11   2018.1 MiB      0.0 MiB           1           .rename_axis("date")
    12     

## Top 10 Emojis

In [139]:
%%write_and_run q2_time_opt.py

def q2_time_opt(file_path):
    
    import pandas as pd
    import emoji
    from collections import Counter
    import re
    
    df = pd.read_json(file_path, lines=True)\

    return Counter(
        (
            (
                i.chars for i in emoji.analyze(("|").join(list(df.content.values)))
                )
            )
            )\
            .most_common(10)



In [141]:
q2_time_opt(FILE_PATH)

[('🙏', 5049),
 ('😂', 3072),
 ('🚜', 2972),
 ('🌾', 2182),
 ('🇮🇳', 2086),
 ('🤣', 1668),
 ('✊', 1651),
 ('❤️', 1382),
 ('🙏🏻', 1317),
 ('💚', 1040)]

In [142]:
%lprun -f q2_time_opt q2_time_opt(FILE_PATH)

Timer unit: 1e-07 s

Total time: 125.24 s
File: C:\Users\Juantc93\AppData\Local\Temp\ipykernel_8900\764165242.py
Function: q2_time_opt at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
     1                                           def q2_time_opt(file_path):
     2                                               
     3         1         24.0     24.0      0.0      import pandas as pd
     4         1         18.0     18.0      0.0      import emoji
     5         1         88.0     88.0      0.0      from collections import Counter
     6         1         15.0     15.0      0.0      import re
     7                                               
     8         1   86978153.0    9e+07      6.9      df = pd.read_json(file_path, lines=True)\
     9                                           
    10         3 1165017399.0    4e+08     93.0      return Counter(
    11                                                   (
    12         2         52.0     26.0      0.

In [143]:
from q2_time_opt import q2_time_opt

print(r"\n\n --- MEMORY_PROFILING ---\nn")
%mprun -f q2_time_opt q2_time_opt(FILE_PATH)

\n\n --- MEMORY_PROFILING ---\nn



Filename: c:\Users\Juantc93\DEV\latam_test\q2_time_opt.py

Line #    Mem usage    Increment  Occurrences   Line Contents
     2    821.9 MiB    821.9 MiB           1   def q2_time_opt(file_path):
     3                                             
     4    821.9 MiB      0.0 MiB           1       import pandas as pd
     5    821.9 MiB      0.0 MiB           1       import emoji
     6    821.9 MiB      0.0 MiB           1       from collections import Counter
     7    821.9 MiB      0.0 MiB           1       import re
     8                                             
     9   2021.3 MiB   1199.4 MiB           1       df = pd.read_json(file_path, lines=True)\
    10                                         
    11   2021.3 MiB    -65.4 MiB           3       return Counter(
    12                                                 (
    13   2086.7 MiB    -65.4 MiB       85846               (
    14   2086.7 MiB     65.4 MiB           1                   i.chars for i in emoji.analyze((

In [144]:
%%write_and_run q2_mem_opt.py

def q2_mem_opt(file_path):

    import ujson
    import emoji
    from collections import Counter
    import re
    import zipfile


    with zipfile.ZipFile(file_path, 'r') as f:
        with f.open(f.filelist[0]) as g:
            content_list=[ujson.loads(line).get("content") for line in g]

    return Counter(
        (
            (
                i.chars for i in emoji.analyze(("|").join(content_list))
                )
            )
            )\
            .most_common(10)


In [145]:
q2_mem_opt(FILE_PATH)

[('🙏', 5049),
 ('😂', 3072),
 ('🚜', 2972),
 ('🌾', 2182),
 ('🇮🇳', 2086),
 ('🤣', 1668),
 ('✊', 1651),
 ('❤️', 1382),
 ('🙏🏻', 1317),
 ('💚', 1040)]

In [146]:
%lprun -f q2_mem_opt q2_mem_opt(FILE_PATH)

Timer unit: 1e-07 s

Total time: 128.81 s
File: C:\Users\Juantc93\AppData\Local\Temp\ipykernel_8900\4014220813.py
Function: q2_mem_opt at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
     1                                           def q2_mem_opt(file_path):
     2                                           
     3         1         22.0     22.0      0.0      import ujson
     4         1         16.0     16.0      0.0      import emoji
     5         1         66.0     66.0      0.0      from collections import Counter
     6         1         10.0     10.0      0.0      import re
     7         1         12.0     12.0      0.0      import zipfile
     8                                           
     9                                           
    10         1       8215.0   8215.0      0.0      with zipfile.ZipFile(file_path, 'r') as f:
    11         1       4108.0   4108.0      0.0          with f.open(f.filelist[0]) as g:
    12         1  101674863.0  

In [105]:
from q2_mem_opt import q2_mem_opt

print(r"\n\n --- MEMORY_PROFILING ---\nn")
%mprun -f q2_mem_opt q2_mem_opt(FILE_PATH)

\n\n --- MEMORY_PROFILING ---\nn



Filename: c:\Users\Juantc93\DEV\latam_test\q2_mem_opt.py

Line #    Mem usage    Increment  Occurrences   Line Contents
     2    698.4 MiB    698.4 MiB           1   def q2_mem_opt(file_path):
     3                                         
     4    698.4 MiB      0.0 MiB           1       import ujson
     5    698.4 MiB      0.0 MiB           1       import emoji
     6    698.4 MiB      0.0 MiB           1       from collections import Counter
     7    698.4 MiB      0.0 MiB           1       import re
     8    698.4 MiB      0.0 MiB           1       import zipfile
     9                                         
    10                                         
    11    698.4 MiB      0.0 MiB           1       with zipfile.ZipFile(file_path, 'r') as f:
    12    698.4 MiB      0.0 MiB           1           with f.open(f.filelist[0]) as g:
    13    698.5 MiB      0.1 MiB      117410               content_list=[ujson.loads(line).get("content") for line in g]
    14               

## Top 10 Influencers

In [90]:
%%write_and_run q3_time_opt.py
def q3_time_opt(file_path):

    import pandas as pd
    from collections import Counter
    import re

    df = pd.read_json(file_path, lines=True)
    mention_pattern=re.compile('@(\w+)')
    return Counter([i.upper() for i in re.findall(mention_pattern,("|").join(list(df.content.values)))]).most_common(10)

In [None]:
q3_time_opt(FILE_PATH)

In [91]:
%lprun -f q3_time_opt q3_time_opt(FILE_PATH)

Timer unit: 1e-07 s

Total time: 8.77143 s
File: C:\Users\Juantc93\AppData\Local\Temp\ipykernel_8900\2262263326.py
Function: q3_time_opt at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
     1                                           def q3_time_opt(file_path):
     2                                           
     3         1         22.0     22.0      0.0      import pandas as pd
     4         1         87.0     87.0      0.0      from collections import Counter
     5         1         15.0     15.0      0.0      import re
     6                                           
     7         1   85937001.0    9e+07     98.0      df = pd.read_json(file_path, lines=True)
     8         1        157.0    157.0      0.0      mention_pattern=re.compile('@(\w+)')
     9         1    1777058.0    2e+06      2.0      return Counter([i.upper() for i in re.findall(mention_pattern,("|").join(list(df.content.values)))]).most_common(10)

In [99]:
from q3_time_opt import q3_time_opt

print(r"\n\n --- MEMORY_PROFILING ---\nn")
%mprun -f q3_time_opt q3_time_opt(FILE_PATH)

\n\n --- MEMORY_PROFILING ---\nn



Filename: c:\Users\Juantc93\DEV\latam_test\q3_time_opt.py

Line #    Mem usage    Increment  Occurrences   Line Contents
     1    677.6 MiB    677.6 MiB           1   def q3_time_opt(file_path):
     2                                         
     3    677.6 MiB      0.0 MiB           1       import pandas as pd
     4    677.6 MiB      0.0 MiB           1       from collections import Counter
     5    677.6 MiB      0.0 MiB           1       import re
     6                                         
     7   2000.7 MiB   1323.0 MiB           1       df = pd.read_json(file_path, lines=True)
     8   2000.7 MiB      0.0 MiB           1       mention_pattern=re.compile('@(\w+)')
     9   2001.1 MiB      0.4 MiB      104069       return Counter([i.upper() for i in re.findall(mention_pattern,("|").join(list(df.content.values)))]).most_common(10)

In [87]:
%%write_and_run q3_mem_opt.py

def q3_mem_opt(file_path):

    import ujson
    from collections import Counter
    import re
    import zipfile


    with zipfile.ZipFile(file_path, 'r') as f:
        with f.open(f.filelist[0]) as g:
            content_list=[ujson.loads(line).get("content") for line in g]
        


    mention_pattern=re.compile('@(\w+)')
    return Counter([i.upper() for i in re.findall(mention_pattern,(' ').join(content_list))]).most_common(10)

In [None]:
q3_mem_opt(FILE_PATH)

In [88]:
%lprun -f q3_mem_opt q3_mem_opt(FILE_PATH)

Timer unit: 1e-07 s

Total time: 9.85909 s
File: C:\Users\Juantc93\AppData\Local\Temp\ipykernel_8900\3792736949.py
Function: q3_mem_opt at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
     1                                           def q3_mem_opt(file_path):
     2                                           
     3         1         28.0     28.0      0.0      import ujson
     4         1        103.0    103.0      0.0      from collections import Counter
     5         1         18.0     18.0      0.0      import re
     6         1         16.0     16.0      0.0      import zipfile
     7                                           
     8                                           
     9         1      10809.0  10809.0      0.0      with zipfile.ZipFile(file_path, 'r') as f:
    10         1        899.0    899.0      0.0          with f.open(f.filelist[0]) as g:
    11         1   97246730.0    1e+08     98.6              content_list=[ujson.loads(line).get

In [89]:
from q3_mem_opt import q3_mem_opt

print(r"\n\n --- MEMORY_PROFILING ---\nn")
%mprun -f q3_mem_opt q3_mem_opt(FILE_PATH)

\n\n --- MEMORY_PROFILING ---\nn



Filename: c:\Users\Juantc93\DEV\latam_test\q3_mem_opt.py

Line #    Mem usage    Increment  Occurrences   Line Contents
     2    232.1 MiB    232.1 MiB           1   def q3_mem_opt(file_path):
     3                                         
     4    232.1 MiB      0.0 MiB           1       import ujson
     5    232.1 MiB      0.0 MiB           1       from collections import Counter
     6    232.1 MiB      0.0 MiB           1       import re
     7    232.1 MiB      0.0 MiB           1       import zipfile
     8                                         
     9                                         
    10    232.1 MiB      0.0 MiB           1       with zipfile.ZipFile(file_path, 'r') as f:
    11    232.1 MiB      0.0 MiB           1           with f.open(f.filelist[0]) as g:
    12    232.1 MiB      0.0 MiB      117410               content_list=[ujson.loads(line).get("content") for line in g]
    13                                                 
    14                       