![表紙](https://www.oreilly.co.jp/books/images/picture978-4-87311-907-6.gif)

このノートブックはオライリー・ジャパンより発行の書籍[『セキュリティエンジニアのための機械学習』](https://www.oreilly.co.jp/books/9784873119076/)のサンプルコードです。コードの解説等は書籍をご参照ください。なお、このコードを動作させた結果について、著者およびオライリー・ジャパンは一切の責任を負いません。

In [1]:
import tweepy
import json
import numpy as np
import pandas as pd

# APIを使用するための鍵
consumer_key="YOUR KEY"
consumer_secret="YOUR KEY"
access_token="YOUR KEY"
access_token_secret="YOUR KEY"

# TweepyによるTwitter APIを使用するための認証のセットアップ
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

# パラメータwait_on_rate_limitを有効化して、 APIの問い合わせ回数の上限に達した場合は必要時間だけ待機
api = tweepy.API(
    auth, 
    parser=tweepy.parsers.JSONParser(), 
    wait_on_rate_limit=True, 
    wait_on_rate_limit_notify=True
    )

In [2]:
# 検索ワードとして exploit http を使用し、リツイートは除外
search_term = "exploit http -filter:retweets"

# 一番古いツイートidを保存するための変数
oldest_tweet = None

# 回収したツイートの一覧を格納するためのリスト
TempDict = []

# 回収したツイート数のカウンター
counter = 0

# １０回ループし、合計1000件のツイートを対象にする
for x in range(10):

    # 最新のツイートから100件を抽出し、テキストをすべて収集する
    public_tweets = api.search(search_term, 
                               count=100, 
                               result_type="recent",
                               tweet_mode="extended", 
                               max_id=oldest_tweet)

    # 条件に一致するツイートの収集
    for tweet in public_tweets["statuses"]:
        #　引用リツイートも除外
        if not 'quoted_status' in tweet:
            TempDict.append(tweet)
                            
            # カウンターに１を追加
            counter += 1

        # 検索結果の一番古いツイートidを代入し、次の検索結果はこの一番古いid
        # より古いものだけを対象にする
        oldest_tweet = tweet["id"]

print("Tweet {}件を収集しました".format(counter))

Tweet 725件を収集しました


In [3]:
data = pd.DataFrame(
    data=[tweet['full_text'] for tweet in TempDict], 
    columns=['TweetText']
    )

In [4]:
import re
URLPATTERN = r'(https?://\S+)'

data['URL'] = data["TweetText"].str.extract(URLPATTERN, expand=False).str.strip()

In [5]:
data.head(20)

Unnamed: 0,TweetText,URL
0,"#Body #Health You will have greater vigour, a ...",https://t.co/Vvt3wX9Bwj
1,Let's not exploit our earth to the moment wher...,https://t.co/Nf86xXjutB
2,Zeno - I have just completed this room! Check ...,https://t.co/lufKvkJh4J
3,Critical Atlassian 0-day is under active explo...,https://t.co/999okEiH0E
4,#Health You can have increased enjoyment; a mu...,https://t.co/msRznfYg8c
5,Fraudsters using cruel new scam to exploit tru...,https://t.co/s8HRTRnmAF
6,Here we love Horses and Wild Horses. Do not ge...,https://t.co/SogsgqscZ3
7,#Java #Linux Graduation Penetration Testing Pr...,https://t.co/fL4SjMbPg1
8,Stade Brestois. Le splendide but de Belaïli av...,https://t.co/wI4Z0lb4gc
9,Cruel new scam that tricks close family into h...,https://t.co/A5cq83fPGX


In [6]:
!pip install pigeonXT-jupyter==0.4.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pigeonXT-jupyter==0.4.1
  Downloading pigeonXT_jupyter-0.4.1-py3-none-any.whl (10 kB)
Installing collected packages: pigeonXT-jupyter
Successfully installed pigeonXT-jupyter-0.4.1


In [7]:
# pandasの列幅を300に設定してツイート本文を確認できるよう設定を変更
pd.set_option("display.max_colwidth", 300)

In [8]:
from pigeonXT import annotate

# 一行ずつツイート本文の内容とURLを表示
annotations = annotate(
    data.index, 
    options=['Exploit', 'NOT Exploit'], 
    display_fn=lambda idx : display(
        data.loc[idx,['TweetText']],
        data.loc[idx,['URL']]
        )
    )

HTML(value='0 of 725 Examples annotated, Current Position: 0 ')

HBox(children=(Button(description='Exploit', style=ButtonStyle()), Button(description='NOT Exploit', style=But…

Output()

In [9]:
# ラベル付けしたデータセットはJSONなので、行列に変換
dataset = pd.DataFrame(annotations.items(), columns=['index', 'label'])

# ツイート本文とURLの列をスクレイピング結果の行列からコピー
dataset['TweetText'] = data['TweetText']
dataset['URL'] = data['URL']

In [10]:
dataset

Unnamed: 0,index,label,TweetText,URL
0,0,NOT Exploit,"#Body #Health You will have greater vigour, a more rewarding physique and also lose fat in the event you exploit the foregoing idea https://t.co/Vvt3wX9Bwj https://t.co/yIUQi1Ihmx",https://t.co/Vvt3wX9Bwj
1,1,NOT Exploit,Let's not exploit our earth to the moment where the only greenery that we can see is our computer or phone screen.\nWorld Environment Day\n\n@sportonixsports \n\nCall : +91 63643 07888 | 80416 81050\n\nhttps://t.co/Nf86xXjutB\n\n#environmentalist #ngo #sustainable #environmentday2022 https://t.c...,https://t.co/Nf86xXjutB
2,2,NOT Exploit,Zeno - I have just completed this room! Check it out: https://t.co/lufKvkJh4J #tryhackme #security #rce #remote code execution #oscp #exploit #http #sql #sudo #zeno via @realtryhackme,https://t.co/lufKvkJh4J
3,3,NOT Exploit,"Critical Atlassian 0-day is under active exploit. You’re patched, right? https://t.co/999okEiH0E https://t.co/ribPS4J46E",https://t.co/999okEiH0E
4,4,NOT Exploit,"#Health You can have increased enjoyment; a much better physique as well as , besides that reduce additional fat for those who exploit this kind of tactic https://t.co/msRznfYg8c https://t.co/TsUYuwEUWX",https://t.co/msRznfYg8c
5,5,NOT Exploit,Fraudsters using cruel new scam to exploit trust between family members and close friends - and rob them of tens of thousands of pounds https://t.co/s8HRTRnmAF,https://t.co/s8HRTRnmAF
6,6,NOT Exploit,"Here we love Horses and Wild Horses. Do not get our Love confused with those who announce their love, then exploit them. It is extinction, playing both sides of the fence. This story reflects the catastrophe in the future, long after the horses extinct.\n\nhttps://t.co/SogsgqscZ3",https://t.co/SogsgqscZ3
7,7,NOT Exploit,"#Java #Linux Graduation Penetration Testing Project: Hello , Every one i need to write a scrpit in python to get exploits for exploit-db and exploit i need to write in the script 5 exploits releated to wordpress i need the script to… https://t.co/fL4SjMbPg1 Click Link to Apply",https://t.co/fL4SjMbPg1
8,8,NOT Exploit,Stade Brestois. Le splendide but de Belaïli avec l'Algérie contre l'Ouganda [Vidéo] - Le Télégramme: Le joueur du Stade Brestois Youcef Belaïli a inscrit un très beau but suite à un exploit individuel lors de la victoire de l'Algérie… https://t.co/wI4Z0lb4gc Algeria_Information,https://t.co/wI4Z0lb4gc
9,9,NOT Exploit,"Cruel new scam that tricks close family into helping crooks raid your accounts » Scammer News: Fraudsters are using a cruel new scam to exploit trust between family members and close friends – and rob them of tens of thousands of… https://t.co/A5cq83fPGX #ScamFraud Seoul, Korea https://t.co/mW86...",https://t.co/A5cq83fPGX


In [11]:
dataset.to_csv('dataset.csv', index=False)