### Telegram chat analysis
#### author: Luis Durazo (github.com/ldurazo)

This project will analyze any telegram chat and give you some curious, fun and sometimes meaningful information

First, load the dataframe from the data folder.

In [None]:
import pandas as pd
df = pd.read_json('../data/telegram.json', dtype={'from_id': str})
df.info()

Get all unique users.

In [None]:
df.get(['from_id', 'from']).groupby(['from_id']).apply(print)

2020/10/2 Observations with Telegram group (id's have been hidden and stored elsewhere):
- id: xxxxxxxxxx (Ma) - 29298
- id: xxxxxxxxxx (Unknown) - 4 - L
- id: xxxxxxxxxx (A) - 34283
- id: xxxxxxxxxx (C) - 6201
- id: xxxxxxxxxx (D) - 10191
- id: xxxxxxxxxx (R) - 17116
- id: xxxxxxxxxx (I) - 32889
- id: xxxxxxxxxx (Mi) - 18727
- id: xxxxxxxxxx (Unknown) - 5817 - W
- id: xxxxxxxxxx (L) - 15242
- id: xxxxxxxxxx (Unknown) - 3771 - X
- id: nan (NaN) - 70

With the data above, time to uncover the unknowns so that we keep only the name without the id.
By looking at a few data samples the message source were easy to determine:

In [None]:
df.loc[df.from_id == 'xxxxxxxxxxxxxx', 'from'] = "L"
df.loc[df.from_id == 'xxxxxxxxxxxxxx', 'from'] = "L"
df.loc[df.from_id == 'xxxxxxxxxxxxxx', 'from'] = "W"
df.loc[df.from_id == 'xxxxxxxxxxxxxx', 'from'] = "X"
df = df[df.from_id != 'nan'] # This is a telegram service, likely updates

Now remove the from_id table, as we have the names:

In [None]:
df = df.drop('from_id', axis=1)
df.info()

Let's take a look at the data again, and see who has the most messages

In [None]:
df[['type','from']].groupby(['from']).count().sort_values(['type'], ascending=False)

By message type, let's see some charts.

In [None]:
voice_df = df.loc[df['media_type'] == 'voice_message'][['from', 'id']]\
    .groupby(['from'], as_index=False)\
    .agg('count')\
    .sort_values(['id'], ascending=False)

import plotly.express as px
fig = px.pie(voice_df, values=voice_df['id'], names=voice_df['from'],
             title='Voice messages per person')
fig.update_traces(textposition='inside', textinfo='value+label+percent')
fig.show()

sticker_df = df.loc[df['media_type'] == 'sticker'][['from', 'id']]\
    .groupby(['from'], as_index=False)\
    .agg('count')\
    .sort_values(['id'], ascending=False)

import plotly.express as px
fig = px.pie(sticker_df, values=sticker_df['id'], names=sticker_df['from'],
             title='Stickers sent')
fig.update_traces(textposition='inside', textinfo='value+label+percent')
fig.show()
