<a href="https://colab.research.google.com/github/moisevictoire/Projet_de_Scrapping/blob/main/notebooks/colab-github-demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Google Colab with GitHub



In [33]:
import numpy as np
import pandas as pd
import nltk
import re
import seaborn as sns
import spacy
import gensim


In [36]:
from sklearn . datasets import fetch_20newsgroups
#Chargement d’un sous - ensemble du 20 newsgroups
categories = ['sci.space', 'rec.autos']

data = fetch_20newsgroups(subset='train', categories=categories, remove=('headers', 'footers', 'quotes'))
df = pd.DataFrame({'text':data.data,'target':data.target})
df.head(10)

Unnamed: 0,text,target
0,Well thank you dennis for your as usual highly...,1
1,\n\nPerhaps a nice used '88 Pontiac Fiero GT? ...,0
2,"I bought a car with a defunct engine, to use f...",0
3,\nI haven't seen any speculation about it. But...,1
4,I am in the process of looking for a half dece...,0
5,\n\nRumor has it that a guy at Dell Computer h...,0
6,"\nActually, the reboost will probably be done ...",1
7,\n\n\n\nFirst I've heard of it. Offhand:\n\nGr...,1
8,------- Blind-Carbon-Copy\n\nTo: spacenews@aus...,1
9,\n\nSherzer Methodology!!!!!!\n\n,1


In [54]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('stopwords')
nltk.download('punkt_tab')
nltk.download('wordnet')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
  #Mise en miniscule
  texte=text.lower()
  #Suppression des ponctuations
  texte=re.sub(r'[^\w\s]','',texte)
  #Suppression des chiffres
  texte=re.sub(r'\d+','',texte)
  #Tokenisation
  tokens=nltk.word_tokenize(texte)
  #Suppression des stopwords
  tokens=[w for w in tokens if w not in stop_words]
  #lemmatisation
  tokens=[lemmatizer.lemmatize(w) for w in tokens]
  return ' '.join(tokens)

df['clean_text'] = df['text'].apply(preprocess_text)
df.head(10)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


Unnamed: 0,text,target,clean_text
0,Well thank you dennis for your as usual highly...,1,well thank dennis usual highly detailed inform...
1,\n\nPerhaps a nice used '88 Pontiac Fiero GT? ...,0,perhaps nice used pontiac fiero gt liter anyon...
2,"I bought a car with a defunct engine, to use f...",0,bought car defunct engine use part old still r...
3,\nI haven't seen any speculation about it. But...,1,havent seen speculation salyut kb design burea...
4,I am in the process of looking for a half dece...,0,process looking half decent aftermarket sport ...
5,\n\nRumor has it that a guy at Dell Computer h...,0,rumor guy dell computer miata totalled would k
6,"\nActually, the reboost will probably be done ...",1,actually reboost probably done last fuel reser...
7,\n\n\n\nFirst I've heard of it. Offhand:\n\nGr...,1,first ive heard offhand griffin longer office ...
8,------- Blind-Carbon-Copy\n\nTo: spacenews@aus...,1,blindcarboncopy spacenewsaustenrandorg ctiaust...
9,\n\nSherzer Methodology!!!!!!\n\n,1,sherzer methodology


In [59]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(df['clean_text'])

# Affichage de la dimension de la matrice
print("Taille de la matrice BOW:", X_bow.shape)
X_bow.toarray()

Taille de la matrice BOW: (1187, 15274)


array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [61]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(df['clean_text'])
print("Taille de la matrice TF-IDF:", X_tfidf.shape)

Taille de la matrice TF-IDF: (1187, 15274)


In [63]:
bigram_vectorizer = CountVectorizer(ngram_range=(1,2))
X_bigrams = bigram_vectorizer.fit_transform(df['clean_text'])
print("Taille de la matrice avec bigrams:", X_bigrams.shape)

Taille de la matrice avec bigrams: (1187, 99336)


In [66]:
from gensim.models import Word2Vec

# Preparation des phrases
sentences = [text.split() for text in df['clean_text']]

# Entraînement du modèle Word2Vec
model = Word2Vec(sentences=sentences, vector_size=100, window=5, min_count=1, workers=4)

 # Acces au vecteur d’un mot :
print(model.wv['space']) # Par exemple, si le mot ’space’ est dans


[-0.19240853  1.5345283   0.96560377 -0.00409734  0.42681038 -2.2085502
 -0.19866167  2.388406   -0.7301665  -0.70181954 -0.7304914  -1.9409374
 -0.24756195  0.08514768  0.03423491 -1.0731382   0.45679307 -1.5186659
  0.70318896 -2.0813382   0.6744037   0.6836929   0.61843204 -0.7613056
 -0.03723383  0.1683944  -1.296787   -0.71203053 -1.5366127  -0.11038778
  0.99013543  0.2197839  -0.01247782 -0.9991489  -0.12133306  1.4796522
  0.20435798 -0.7361682  -0.5916446  -2.1484818   0.29636446 -1.222076
 -0.27381566 -0.27598822  0.68510896 -0.72981834 -1.066588   -0.16067861
  0.47435054  1.0491514   0.12796137 -1.2129654  -0.5428284  -0.5984891
 -0.99063194  0.5757258   0.45754728  0.14235498 -0.74378175  0.28809288
  0.6250359   0.46591952 -0.2917957   0.35993928 -1.106629    0.8050609
  0.11777952  0.29719317 -0.9973455   1.091483   -0.816019    0.37059367
  1.331131   -1.2595742   1.1741241   0.6343277   0.23526302 -0.23913091
 -0.51580054  0.2686628  -0.8004446  -0.48349643 -0.70277506


[Google Colaboratory](http://colab.research.google.com) is designed to integrate cleanly with GitHub, allowing both loading notebooks from github and saving notebooks to github.

## Loading Public Notebooks Directly from GitHub

Colab can load public github notebooks directly, with no required authorization step.

For example, consider the notebook at this address: https://github.com/googlecolab/colabtools/blob/main/notebooks/colab-github-demo.ipynb.

The direct colab link to this notebook is: https://colab.research.google.com/github/googlecolab/colabtools/blob/main/notebooks/colab-github-demo.ipynb.

To generate such links in one click, you can use the [Open in Colab](https://chrome.google.com/webstore/detail/open-in-colab/iogfkhleblhcpcekbiedikdehleodpjo) Chrome extension.

## Browsing GitHub Repositories from Colab

Colab also supports special URLs that link directly to a GitHub browser for any user/organization, repository, or branch. For example:

- http://colab.research.google.com/github will give you a general github browser, where you can search for any github organization or username.
- http://colab.research.google.com/github/googlecolab/ will open the repository browser for the ``googlecolab`` organization. Replace ``googlecolab`` with any other github org or user to see their repositories.
- http://colab.research.google.com/github/googlecolab/colabtools/ will let you browse the main branch of the ``colabtools`` repository within the ``googlecolab`` organization. Substitute any user/org and repository to see its contents.
- http://colab.research.google.com/github/googlecolab/colabtools/blob/main will let you browse ``main`` branch of the ``colabtools`` repository within the ``googlecolab`` organization. (don't forget the ``blob`` here!) You can specify any valid branch for any valid repository.

## Loading Private Notebooks

Loading a notebook from a private GitHub repository is possible, but requires an additional step to allow Colab to access your files.
Do the following:

1. Navigate to http://colab.research.google.com/github.
2. Click the "Include Private Repos" checkbox.
3. In the popup window, sign-in to your Github account and authorize Colab to read the private files.
4. Your private repositories and notebooks will now be available via the github navigation pane.

## Saving Notebooks To GitHub or Drive

Any time you open a GitHub hosted notebook in Colab, it opens a new editable view of the notebook. You can run and modify the notebook without worrying about overwriting the source.

If you would like to save your changes from within Colab, you can use the File menu to save the modified notebook either to Google Drive or back to GitHub. Choose **File→Save a copy in Drive** or **File→Save a copy to GitHub** and follow the resulting prompts. To save a Colab notebook to GitHub requires giving Colab permission to push the commit to your repository.

## Open In Colab Badge

Anybody can open a copy of any github-hosted notebook within Colab. To make it easier to give people access to live views of GitHub-hosted notebooks,
colab provides a [shields.io](http://shields.io/)-style badge, which appears as follows:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/googlecolab/colabtools/blob/main/notebooks/colab-github-demo.ipynb)

The markdown for the above badge is the following:

```markdown
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/googlecolab/colabtools/blob/main/notebooks/colab-github-demo.ipynb)
```

The HTML equivalent is:

```HTML
<a href="https://colab.research.google.com/github/googlecolab/colabtools/blob/main/notebooks/colab-github-demo.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
```

Remember to replace the notebook URL in this template with the notebook you want to link to.