# Eye of Emergency üö® - Analyse des Tweets de Catastrophes

## Objectif du projet
D√©veloppement d'un mod√®le d'apprentissage automatique capable de classer des tweets signalant des catastrophes naturelles r√©elles pour aider les intervenants d'urgence et le public √† acc√©der √† des informations pr√©cises et fiables en p√©riode de crise.



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('../data/raw/train_tweets.csv', index_col='id')


In [3]:
df.shape

(7613, 4)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7613 entries, 1 to 10873
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   keyword   7552 non-null   object
 1   location  5080 non-null   object
 2   text      7613 non-null   object
 3   target    7613 non-null   int64 
dtypes: int64(1), object(3)
memory usage: 297.4+ KB


In [5]:
df.head()

Unnamed: 0_level_0,keyword,location,text,target
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,,,Our Deeds are the Reason of this #earthquake M...,1
4,,,Forest fire near La Ronge Sask. Canada,1
5,,,All residents asked to 'shelter in place' are ...,1
6,,,"13,000 people receive #wildfires evacuation or...",1
7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [6]:
df.describe(include='all')

Unnamed: 0,keyword,location,text,target
count,7552,5080,7613,7613.0
unique,221,3341,7503,
top,fatalities,USA,11-Year-Old Boy Charged With Manslaughter of T...,
freq,45,104,10,
mean,,,,0.42966
std,,,,0.49506
min,,,,0.0
25%,,,,0.0
50%,,,,0.0
75%,,,,1.0


### Remarque
- Les id ne sont pas continus : de 1 √† 10873<br>
Alors qu'il y a 7613 lignes<br>
- `keyword` et surtout `location` ont des valeurs manquantes (respectivement 61 et 2533)

In [7]:
df.isna().sum()

keyword       61
location    2533
text           0
target         0
dtype: int64

### `target`
Le jeu de donn√© est assez √©quilibr√©
57 % de 0 (fausse catastrophe) et 43 % de 1 (vrai catastrophe)

In [8]:
target_distrib = df['target'].value_counts()
# Calculer le pourcentage de chaque classe
target_normalized = df['target'].value_counts(normalize=True) * 100
print(f"Distribution des classes : {target_distrib}\n\n\
      Pourcentage de chaque classe :\n{target_normalized}")

Distribution des classes : target
0    4342
1    3271
Name: count, dtype: int64

      Pourcentage de chaque classe :
target
0    57.034021
1    42.965979
Name: proportion, dtype: float64


### `keyword`
Il faudra les normaliser

In [9]:
print(f"{df['keyword'].value_counts()}\n\n\
    Nombre de keyword : {df['keyword'].value_counts().sum()}\n\n\
    Nombre de keyword uniques : {df['keyword'].nunique()}")

keyword
fatalities               45
deluge                   42
armageddon               42
sinking                  41
damage                   41
                         ..
forest%20fire            19
epicentre                12
threat                   11
inundation               10
radiation%20emergency     9
Name: count, Length: 221, dtype: int64

    Nombre de keyword : 7552

    Nombre de keyword uniques : 221


### `localisations`
Il faudra les normaliser

In [10]:
print(f"{df['location'].value_counts()}\n\n\
    Nombre de localisations : {df['location'].value_counts().sum()}\n\n\
    Nombre de localisations uniques : {df['location'].nunique()}")

location
USA                    104
New York                71
United States           50
London                  45
Canada                  29
                      ... 
Montr√å¬©al, Qu√å¬©bec       1
Montreal                 1
√å√èT: 6.4682,3.18287      1
Live4Heed??              1
Lincoln                  1
Name: count, Length: 3341, dtype: int64

    Nombre de localisations : 5080

    Nombre de localisations uniques : 3341


### `text`

Verifie le typage de chaque observarion pour `text`<br>
Tous sont `str`

In [11]:
print(df['text'].dtype)
print(df['text'].apply(type).value_counts())


object
text
<class 'str'>    7613
Name: count, dtype: int64


In [12]:
print(f"Nombre de textes :         {df['text'].value_counts().sum()}, il n'y a pas de valeurs manquantes\n\
Nombre de textes uniques : {df['text'].nunique()}\n\n\
7613 - 7503 = 110 doublons")

Nombre de textes :         7613, il n'y a pas de valeurs manquantes
Nombre de textes uniques : 7503

7613 - 7503 = 110 doublons


#### Il y a donc 110 doublons dans `text` !
69 valeurs unique de doublons<br>
179 occurences de doublons<br>

In [13]:
doublons = df[df['text'].duplicated(keep=False)]
print(f"Nombre de valeurs uniques parmis les doublons: {doublons['text'].nunique()}")
print(f"Nombre d'occurences de doublons : {doublons['text'].shape[0]}")


Nombre de valeurs uniques parmis les doublons: 69
Nombre d'occurences de doublons : 179


Exemples de doublons dans `text`

In [14]:
# Identifie les textes dupliqu√©s
df[df['text'].duplicated(keep=False)]

#  Affiche seulement le texte des doublons
# df['text'][df['text'].duplicated(keep=False)]

Unnamed: 0_level_0,keyword,location,text,target
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
59,ablaze,Live On Webcam,Check these out: http://t.co/rOI2NSmEJJ http:/...,0
68,ablaze,Live On Webcam,Check these out: http://t.co/rOI2NSmEJJ http:/...,0
156,aftershock,US,320 [IR] ICEMOON [AFTERSHOCK] | http://t.co/vA...,0
165,aftershock,US,320 [IR] ICEMOON [AFTERSHOCK] | http://t.co/vA...,0
171,aftershock,Switzerland,320 [IR] ICEMOON [AFTERSHOCK] | http://t.co/TH...,0
...,...,...,...,...
10855,,,Evacuation order lifted for town of Roosevelt:...,1
10867,,,#stormchase Violent Record Breaking EF-5 El Re...,1
10870,,,@aria_ahrary @TheTawniest The out of control w...,1
10871,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1


### Analyse des doublons de texte

Les doublon de `text` ne sont pas forc√©ment contigus<br>
Par exemple la chaine de caract√®res suivante apparait aux id 4850 et 10855 :

In [15]:
df.loc[10855,'text']

'Evacuation order lifted for town of Roosevelt: http://t.co/EDyfo6E2PU http://t.co/M5KxLPKFA1'

In [16]:
df[df['text'] == 'Evacuation order lifted for town of Roosevelt: http://t.co/EDyfo6E2PU http://t.co/M5KxLPKFA1']


Unnamed: 0_level_0,keyword,location,text,target
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
4850,evacuation,"Tri-Cities, Wash.",Evacuation order lifted for town of Roosevelt:...,1
10855,,,Evacuation order lifted for town of Roosevelt:...,1


Certain textes apparaissent + de 3 ou m√™me + de 6 fois

In [None]:
# combien de textes apparaissent plus de 2 fois
df['text'].value_counts()[lambda x: x > 2].shape[0]

19

In [None]:
# combien de textes apparaissent plus de 6 fois
df['text'].value_counts()[lambda x: x > 6].shape[0]

1

In [65]:
# combien de textes apparaissent plus de 5 fois
df['text'].value_counts()[lambda x: x > 5].shape[0]

4

In [None]:
# Textes qui apparaissent plus de 5 fois
texts_over_6 = df['text'].value_counts()[lambda x: x > 5].index

# Filtrer les lignes correspondantes
filtered_df = df[df['text'].isin(texts_over_6)]

# Affiche les id des lignes correspondantes
print(filtered_df.index)


Index([4656, 4659, 4669, 4672, 4684, 4691, 5113, 5127, 5130, 5137, 5140, 5144,
       5145, 5153, 5157, 5159, 6087, 6090, 6097, 6111, 6118, 6132, 9095, 9098,
       9107, 9113, 9114, 9135],
      dtype='int64', name='id')


In [67]:
df.loc[4656,'text']

'He came to a land which was engulfed in tribal war and turned it into a land of peace i.e. Madinah. #ProphetMuhammad #islam'

In [None]:
df[df['text']=='He came to a land which was engulfed in tribal war and turned it into a land of peace i.e. Madinah. #ProphetMuhammad #islam']

Unnamed: 0_level_0,keyword,location,text,target
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
4656,engulfed,,He came to a land which was engulfed in tribal...,0
4659,engulfed,Kuwait,He came to a land which was engulfed in tribal...,1
4669,engulfed,Bahrain,He came to a land which was engulfed in tribal...,1
4672,engulfed,,He came to a land which was engulfed in tribal...,0
4684,engulfed,,He came to a land which was engulfed in tribal...,0
4691,engulfed,,He came to a land which was engulfed in tribal...,0


## Remarque
La localisation semble √™tre positivement corr√©l√© avec la cible<br>
Sur 6 observations avec un `text` identique, seul les 2 avec une `localisation` on une `target` class√©e `1`<br>

Essayons avec un `text` √† + de 6 occurences :

In [None]:
# 1. Identifier les textes qui apparaissent plus de 6 fois
texts_over_6 = df['text'].value_counts()[lambda x: x > 6].index

# 2. S√©lectionner les lignes du DataFrame qui ont ces textes
filtered_df = df[df['text'].isin(texts_over_6)]

# 3. Obtenir leurs index (num√©ros de lignes)
indices = filtered_df.index.tolist()

# 4. Afficher les index
print(indices)

[5113, 5127, 5130, 5137, 5140, 5144, 5145, 5153, 5157, 5159]


In [63]:
df.loc[5113,'text']

'11-Year-Old Boy Charged With Manslaughter of Toddler: Report: An 11-year-old boy has been charged with manslaughter over the fatal sh...'

In [69]:
df[df['text']=='11-Year-Old Boy Charged With Manslaughter of Toddler: Report: An 11-year-old boy has been charged with manslaughter over the fatal sh...']

Unnamed: 0_level_0,keyword,location,text,target
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
5113,fatal,,11-Year-Old Boy Charged With Manslaughter of T...,1
5127,fatal,Varanasi,11-Year-Old Boy Charged With Manslaughter of T...,1
5130,fatal,Thane,11-Year-Old Boy Charged With Manslaughter of T...,1
5137,fatal,,11-Year-Old Boy Charged With Manslaughter of T...,1
5140,fatal,,11-Year-Old Boy Charged With Manslaughter of T...,1
5144,fatal,,11-Year-Old Boy Charged With Manslaughter of T...,1
5145,fatal,Bangalore,11-Year-Old Boy Charged With Manslaughter of T...,1
5153,fatal,Dimapur,11-Year-Old Boy Charged With Manslaughter of T...,1
5157,fatal,,11-Year-Old Boy Charged With Manslaughter of T...,1
5159,fatal,,11-Year-Old Boy Charged With Manslaughter of T...,1


L'hypoth√®se ne se v√©rifie pas.<br>
Essayons avec un `text` √†  6 occurences :

In [58]:
df[df['text']=='#Bestnaijamade: 16yr old PKK suicide bomber who detonated bomb in ... http://t.co/KSAwlYuX02 bestnaijamade bestnaijamade bestnaijamade be\x89√õ_']

Unnamed: 0_level_0,keyword,location,text,target
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
9095,suicide%20bomb,Nigeria,#Bestnaijamade: 16yr old PKK suicide bomber wh...,1
9098,suicide%20bomb,Nigeria,#Bestnaijamade: 16yr old PKK suicide bomber wh...,1
9107,suicide%20bomb,Nigeria,#Bestnaijamade: 16yr old PKK suicide bomber wh...,1
9113,suicide%20bomb,Nigeria,#Bestnaijamade: 16yr old PKK suicide bomber wh...,1
9114,suicide%20bomb,Nigeria,#Bestnaijamade: 16yr old PKK suicide bomber wh...,1
9135,suicide%20bomb,Nigeria,#Bestnaijamade: 16yr old PKK suicide bomber wh...,1


### Doublon complet (toutes variables)

In [31]:
# Afficher uniquement les observations qui sont des doublons complets (toutes colonnes confondues)
full_duplicates = df[df.duplicated(keep=False)]
print(f"Nombre d'observations dupliqu√©es (toutes colonnes) : {full_duplicates.shape[0]}")
display(full_duplicates)

Nombre d'observations dupliqu√©es (toutes colonnes) : 87


Unnamed: 0_level_0,keyword,location,text,target
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
59,ablaze,Live On Webcam,Check these out: http://t.co/rOI2NSmEJJ http:/...,0
68,ablaze,Live On Webcam,Check these out: http://t.co/rOI2NSmEJJ http:/...,0
156,aftershock,US,320 [IR] ICEMOON [AFTERSHOCK] | http://t.co/vA...,0
165,aftershock,US,320 [IR] ICEMOON [AFTERSHOCK] | http://t.co/vA...,0
171,aftershock,Switzerland,320 [IR] ICEMOON [AFTERSHOCK] | http://t.co/TH...,0
...,...,...,...,...
9135,suicide%20bomb,Nigeria,#Bestnaijamade: 16yr old PKK suicide bomber wh...,1
9207,suicide%20bombing,,'Suicide bombing at [location named]...' #prem...,1
9225,suicide%20bombing,,'Suicide bombing at [location named]...' #prem...,1
9531,terrorist,MAD as Hell,RT AbbsWinston: #Zionist #Terrorist kidnapped ...,1


# Nettoyage

In [32]:
# df['clean_text'] = (
#     df['text']
#     .str.lower()
#     .str.strip()
#     .str.replace(r'\s+', ' ', regex=True)
# )

# # Doublons apr√®s nettoyage
# df[df['clean_text'].duplicated(keep=False)]


In [33]:
# df.shape