## Projet NLP : Résumé automatique d’articles de tennis


In [1]:
# Étape 1 : Chargement des données
import pandas as pd
df = pd.read_csv('tennis_articles.csv', encoding='latin1')
df.drop(columns=['article_title'], inplace=True)  # Supprime la colonne de titre
df.head()

Unnamed: 0,article_id,article_text,source
0,1,Maria Sharapova has basically no friends as te...,https://www.tennisworldusa.org/tennis/news/Mar...
1,2,"BASEL, Switzerland (AP)  Roger Federer advanc...",http://www.tennis.com/pro-game/2018/10/copil-s...
2,3,Roger Federer has revealed that organisers of ...,https://scroll.in/field/899938/tennis-roger-fe...
3,4,Kei Nishikori will try to end his long losing ...,http://www.tennis.com/pro-game/2018/10/nishiko...
4,5,"Federer, 37, first broke through on tour over ...",https://www.express.co.uk/sport/tennis/1036101...


In [None]:
# @title article_id

from matplotlib import pyplot as plt
df['article_id'].plot(kind='hist', bins=20, title='article_id')
plt.gca().spines[['top', 'right',]].set_visible(False)

# Interprétation
Nous avons un corpus de 25 articles de tennis, que nous allons résumer automatiquement à l’aide de NLP.

In [4]:
# Étape 2 : Tokenisation des phrases
import nltk
nltk.download('punkt')
nltk.download('punkt_tab') # Download the missing resource
from nltk.tokenize import sent_tokenize

sentences = []
for text in df['article_text']:
    sentences.extend(sent_tokenize(text))
print(f"Nombre de phrases totales : {len(sentences)}")
sentences[:5]

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Nombre de phrases totales : 130


['Maria Sharapova has basically no friends as tennis players on the WTA Tour.',
 "The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much.",
 'I think everyone knows this is my job here.',
 "When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.",
 "So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match."]

# Lecture métier
Le résumé se fera sur la base de phrases importantes. On les extrait ici.

In [None]:
# Étape 3 : Téléchargement des vecteurs GloVe (pré-entraînés)
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip -d glove.6B

--2025-07-20 08:31:05--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2025-07-20 08:31:06--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2025-07-20 08:31:06--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip.1’


2

In [None]:
# Étape 4 : Chargement des embeddings GloVe
import numpy as np
glove_path = 'glove.6B/glove.6B.100d.txt'
embeddings_index = {}
with open(glove_path, encoding='utf8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = vector
print(f"{len(embeddings_index)} mots chargés.")

In [None]:
# Étape 5 : Nettoyage des phrases
import re
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
clean_sentences = []
for s in sentences:
    s = s.lower()
    s = re.sub(r'[^a-zA-Z]', ' ', s)
    s = ' '.join([w for w in s.split() if w not in stop_words])
    clean_sentences.append(s)

In [None]:
# Étape 6 : Vecteur moyen de chaque phrase
sentence_vectors = []
for sent in clean_sentences:
    words = sent.split()
    if len(words) != 0:
        vectors = [embeddings_index.get(w, np.zeros((100,))) for w in words]
        sentence_vectors.append(np.mean(vectors, axis=0))
    else:
        sentence_vectors.append(np.zeros((100,)))

In [None]:
# Étape 7 : Matrice de similarité
from sklearn.metrics.pairwise import cosine_similarity
sim_mat = np.zeros([len(sentences), len(sentences)])

for i in range(len(sentences)):
    for j in range(len(sentences)):
        if i != j:
            sim_mat[i][j] = cosine_similarity([sentence_vectors[i]], [sentence_vectors[j]])[0, 0]

In [None]:
# Étape 8 : Algorithme PageRank
import networkx as nx
nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)

In [None]:
# Étape 9 : Génération du résumé
ranked_sentences = sorted(((scores[i], s) for i, s in enumerate(sentences)), reverse=True)
print("Résumé :\n")
for i in range(10):
    print(ranked_sentences[i][1])

# Conclusion métier
Cette approche basée sur la similarité sémantique et PageRank permet de résumer automatiquement des documents sans apprentissage supervisé.
Elle est particulièrement utile dans des contextes de veille, de reporting automatisé ou d’analyse de grandes quantités de texte.

**Limites** : Pas adaptée pour des textes très courts ou très spécialisés sans adaptation des embeddings ou d'un modèle pré-entraîné.

In [5]:
# Étape 3 : Téléchargement des vecteurs GloVe (pré-entraînés)
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip -d glove.6B

--2025-07-20 08:21:46--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2025-07-20 08:21:46--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2025-07-20 08:21:46--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


202

In [6]:
# Étape 4 : Chargement des embeddings GloVe
import numpy as np
glove_path = 'glove.6B/glove.6B.100d.txt'
embeddings_index = {}
with open(glove_path, encoding='utf8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = vector
print(f"{len(embeddings_index)} mots chargés.")

400000 mots chargés.


In [7]:
# Étape 5 : Nettoyage des phrases
import re
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
clean_sentences = []
for s in sentences:
    s = s.lower()
    s = re.sub(r'[^a-zA-Z]', ' ', s)
    s = ' '.join([w for w in s.split() if w not in stop_words])
    clean_sentences.append(s)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [8]:
# Étape 6 : Vecteur moyen de chaque phrase
sentence_vectors = []
for sent in clean_sentences:
    words = sent.split()
    if len(words) != 0:
        vectors = [embeddings_index.get(w, np.zeros((100,))) for w in words]
        sentence_vectors.append(np.mean(vectors, axis=0))
    else:
        sentence_vectors.append(np.zeros((100,)))

In [9]:
# Étape 7 : Matrice de similarité
from sklearn.metrics.pairwise import cosine_similarity
sim_mat = np.zeros([len(sentences), len(sentences)])

for i in range(len(sentences)):
    for j in range(len(sentences)):
        if i != j:
            sim_mat[i][j] = cosine_similarity([sentence_vectors[i]], [sentence_vectors[j]])[0, 0]

In [10]:
# Étape 8 : Algorithme PageRank
import networkx as nx
nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)

In [11]:
# Étape 9 : Génération du résumé
ranked_sentences = sorted(((scores[i], s) for i, s in enumerate(sentences)), reverse=True)
print("Résumé :\n")
for i in range(10):
    print(ranked_sentences[i][1])

Résumé :

I was on a nice trajectorythen, Reid recalled.If I hadnt got sick, I think I could have started pushing towards the second week at the slams and then who knows. Duringa comeback attempt some five years later, Reid added Bernard Tomic and 2018 US Open Federer slayer John Millman to his list of career scalps.
Major players feel that a big event in late November combined with one in January before the Australian Open will mean too much tennis and too little rest.
So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.
Speaking at the Swiss Indoors tournament where he will play in Sundays final against Romanian qualifier Marius Copil, the world number three said that given the impossibly short time frame to make a decision, he opted out of any commitment.
Currently in ninth place, Nishikori with a win could move to within 125 points of the cut for the eight-man event in London next mo