# Scraping information from a Twitch live streaming channel

The Twitch API offered many possibilities, including interacting with chat users using a chatbot, but sadly did not offer the possibility to retrieve chat information from a given channel.   
In this notebook, we show one way to do so in Python using the library **socket**, which interacts dynamically with a website.

We are only interested in retrieving URLs listed in the chat log, as these URLs could lead users to websites not respecting the UK Gambling laws. The information is destined to the Enforcement Team. User data (their nicknames on twitch) should not be conserved, as these are irrelevant to the Enforcement Team and could breach data privacy regulations.

NB : this notebook will *not* run if launched from Google Colab. Please run the notebook from VS code instead or create a corresponding *.py file.

In [1]:
import socket
from emoji import demojize
import logging
import time
import pandas as pd
import numpy as np

In [2]:
# define parameters to interact with Twitch website using socket.
# The token oauth: is to be obtained from the twitch website in "developers" section

server   = 'irc.chat.twitch.tv'
port     = 6667
nickname = 'eaquearcana'
token    = "oauth:jmzvwalm7hajyuzp423s0cr18ssbqt"
channel  = '#casinodaddy'

In [4]:
## In a general way, a socket is instantiated and connected required parameters sent to the server. Then the socket "receives" bytes
# in format utf-8 which can be stored in an external file

In [3]:
sock = socket.socket()
sock.connect((server, port))

sock.send(f"PASS {token}\n".encode('utf-8'))
sock.send(f"NICK {nickname}\n".encode('utf-8'))
sock.send(f"JOIN {channel}\n".encode('utf-8'))


18

In [5]:
# These two lines below should be launched twice to start seeing some chat and verify that things work well
# the 1st launch should return some generic and vague statements from twitch indicating things worked. WAit 5 seconds then launch it again
# the 2nd launch should return a section of chat, which the user can verify by connecting to the channel chat.

resp = sock.recv(2048).decode('utf-8')

print(resp)



:inyanghost!inyanghost@inyanghost.tmi.twitch.tv PRIVMSG #casinodaddy :Clap



In [6]:

# The socket should be closed when done !
sock.close()


If things worked well so far, great ! Now, its time to "record" the chat in an external file in order to work on it later.

In [7]:
sock = socket.socket()
sock.connect((server, port))

sock.send(f"PASS {token}\n".encode('utf-8'))
sock.send(f"NICK {nickname}\n".encode('utf-8'))
sock.send(f"JOIN {channel}\n".encode('utf-8'))


# launch twice to start seeing some chat and verify
resp = sock.recv(2048).decode('utf-8')
print(resp)
time.sleep(10)
# and 2...:
resp = sock.recv(2048).decode('utf-8')
print(resp)

# prepare the file to store the chat in, and determine the format of writing, aka including the date:
logging.basicConfig(level=logging.DEBUG,
                    format='%(asctime)s — %(message)s',
                    datefmt='%Y-%m-%d_%H:%M:%S',
                    handlers=[logging.FileHandler('chat_test.log', encoding='utf-8')])

logging.info(resp)

sock.close()

# check whether the file "chat.log" contains some chat data

:tmi.twitch.tv 001 eaquearcana :Welcome, GLHF!
:tmi.twitch.tv 002 eaquearcana :Your host is tmi.twitch.tv
:tmi.twitch.tv 003 eaquearcana :This server is rather new
:tmi.twitch.tv 004 eaquearcana :-
:tmi.twitch.tv 375 eaquearcana :-
:tmi.twitch.tv 372 eaquearcana :You are in a maze of twisty passages, all alike.
:tmi.twitch.tv 376 eaquearcana :>

:eaquearcana!eaquearcana@eaquearcana.tmi.twitch.tv JOIN #casinodaddy
:eaquearcana.tmi.twitch.tv 353 eaquearcana = #casinodaddy :eaquearcana
:eaquearcana.tmi.twitch.tv 366 eaquearcana #casinodaddy :End of /NAMES list
:cncdaddy!cncdaddy@cncdaddy.tmi.twitch.tv PRIVMSG #casinodaddy :700x?



In [15]:
## If things worked smoothly, its now time to record a long amount of chat to work on it later :

sock = socket.socket()
sock.connect((server, port))

sock.send(f"PASS {token}\n".encode('utf-8'))
sock.send(f"NICK {nickname}\n".encode('utf-8'))
sock.send(f"JOIN {channel}\n".encode('utf-8'))



resp = sock.recv(2048).decode('utf-8')
print(resp)
time.sleep(3)
# and 2...:
resp = sock.recv(2048).decode('utf-8')
print(resp)

# logging.basicConfig(level=logging.DEBUG,
#                     format='%(asctime)s — %(message)s',
#                     datefmt='%Y-%m-%d_%H:%M:%S',
#                     handlers=[logging.FileHandler('chat_test.log', encoding='utf-8')])


logging.basicConfig(level=logging.DEBUG,
                    format='%(asctime)s — %(message)s',
                    datefmt='%Y-%m-%d_%H:%M:%S',
                    handlers=[logging.FileHandler('chat.log', encoding='utf-8')])

logging.info(resp)

# Choose duration in minutes for how long to record the live chat.
# The server will sometimes send a "PING" request ; the code should watch for that signal and send a "PONG" back to keep on recording

duration_minutes = 2
duration_seconds = duration_minutes * 60
start_time       = time.time()

while time.time() - start_time < duration_seconds:
    resp = sock.recv(2048).decode('utf-8')

    if resp.startswith('PING'):
        sock.send("PONG\n".encode('utf-8'))

    elif len(resp) > 0:
        logging.info(demojize(resp))


sock.close()


:tmi.twitch.tv 001 eaquearcana :Welcome, GLHF!
:tmi.twitch.tv 002 eaquearcana :Your host is tmi.twitch.tv
:tmi.twitch.tv 003 eaquearcana :This server is rather new
:tmi.twitch.tv 004 eaquearcana :-
:tmi.twitch.tv 375 eaquearcana :-
:tmi.twitch.tv 372 eaquearcana :You are in a maze of twisty passages, all alike.
:tmi.twitch.tv 376 eaquearcana :>

:eaquearcana!eaquearcana@eaquearcana.tmi.twitch.tv JOIN #casinodaddy
:eaquearcana.tmi.twitch.tv 353 eaquearcana = #casinodaddy :eaquearcana
:eaquearcana.tmi.twitch.tv 366 eaquearcana #casinodaddy :End of /NAMES list



--- Logging error ---
Traceback (most recent call last):
  File "c:\Users\mbgnwmab\AppData\Local\anaconda3\envs\s2ds_ML\lib\logging\__init__.py", line 1036, in emit
    stream.write(msg)
ValueError: I/O operation on closed file.
Call stack:
  File "c:\Users\mbgnwmab\AppData\Local\anaconda3\envs\s2ds_ML\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\Users\mbgnwmab\AppData\Local\anaconda3\envs\s2ds_ML\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "c:\Users\mbgnwmab\AppData\Local\anaconda3\envs\s2ds_ML\lib\site-packages\ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
  File "c:\Users\mbgnwmab\AppData\Local\anaconda3\envs\s2ds_ML\lib\site-packages\traitlets\config\application.py", line 1043, in launch_instance
    app.start()
  File "c:\Users\mbgnwmab\AppData\Local\anaconda3\envs\s2ds_ML\lib\site-packages\ipykernel\kernelapp.py", line 712, in start
    self.io_loop.start()
  File "c:\Users\mbgnwmab\A

KeyboardInterrupt: 

In [11]:
import socket
from emoji import demojize
import logging
import time

def establish_socket_connection(server, port, token, nickname, channel):
    sock = socket.socket()
    sock.connect((server, port))
    sock.send(f"PASS {token}\n".encode('utf-8'))
    sock.send(f"NICK {nickname}\n".encode('utf-8'))
    sock.send(f"JOIN {channel}\n".encode('utf-8'))
    return sock

def record_chat(sock, duration_minutes, log_file):
    start_time = time.time()
    duration_seconds = duration_minutes * 60
    while time.time() - start_time < duration_seconds:
        resp = sock.recv(2048).decode('utf-8')
        if resp.startswith('PING'):
            sock.send("PONG\n".encode('utf-8'))
        elif len(resp) > 0:
            log_file.write(demojize(resp))
    log_file.close()

server = 'irc.chat.twitch.tv'
port = 6667
nickname = 'eaquearcana'
token = "oauth:jmzvwalm7hajyuzp423s0cr18ssbqt"
channel = '#casinodaddy'

sock = establish_socket_connection(server, port, token, nickname, channel)

logging.basicConfig(level=logging.DEBUG,
                        format='%(asctime)s — %(message)s',
                        datefmt='%Y-%m-%d_%H:%M:%S',
                        handlers=[logging.FileHandler('chat.log', encoding='utf-8')])

record_chat(sock, 1, logging.getLogger().handlers[0].stream)

sock.close()

In [16]:
# Define the functions
import socket
from emoji import demojize
import logging
import time

def setup_logger(log_file):
    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s — %(message)s',
        datefmt='%Y-%m-%d_%H:%M:%S',
        handlers=[logging.FileHandler(log_file, encoding='utf-8')]
    )

def record_chat(server, port, nickname, token, channel, log_file, duration_minutes):
    sock = socket.socket()
    sock.connect((server, port))

    sock.send(f"PASS {token}\n".encode('utf-8'))
    sock.send(f"NICK {nickname}\n".encode('utf-8'))
    sock.send(f"JOIN {channel}\n".encode('utf-8'))

    start_time = time.time()
    duration_seconds = duration_minutes * 60

    while time.time() - start_time < duration_seconds:
        resp = sock.recv(2048).decode('utf-8')

        if resp.startswith('PING'):
            sock.send("PONG\n".encode('utf-8'))
        elif len(resp) > 0:
            logging.info(demojize(resp))

    sock.close()

# Configure the logger
setup_logger('chat.log')

# Example usage
server = 'irc.chat.twitch.tv'
port = 6667
nickname = 'eaquearcana'
token = "oauth:jmzvwalm7hajyuzp423s0cr18ssbqt"
channel = '#casinodaddy'
log_file = 'chat.log'
duration_minutes = 1  # Adjust as needed

# Record chat
record_chat(server, port, nickname, token, channel, log_file, duration_minutes)


--- Logging error ---
Traceback (most recent call last):
  File "c:\Users\mbgnwmab\AppData\Local\anaconda3\envs\s2ds_ML\lib\logging\__init__.py", line 1036, in emit
    stream.write(msg)
ValueError: I/O operation on closed file.
Call stack:
  File "c:\Users\mbgnwmab\AppData\Local\anaconda3\envs\s2ds_ML\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\Users\mbgnwmab\AppData\Local\anaconda3\envs\s2ds_ML\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "c:\Users\mbgnwmab\AppData\Local\anaconda3\envs\s2ds_ML\lib\site-packages\ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
  File "c:\Users\mbgnwmab\AppData\Local\anaconda3\envs\s2ds_ML\lib\site-packages\traitlets\config\application.py", line 1043, in launch_instance
    app.start()
  File "c:\Users\mbgnwmab\AppData\Local\anaconda3\envs\s2ds_ML\lib\site-packages\ipykernel\kernelapp.py", line 712, in start
    self.io_loop.start()
  File "c:\Users\mbgnwmab\A

KeyboardInterrupt: 

It is now time to process the chat log. One can see that the log contains the user name, a block "PRIVMSG", and then the chat contents. Sometimes, the date and time appear at the beginning of the string.

It is now time to play with strings



In [None]:

import pandas as pd

# Create lists to store the extracted information
date_times = []
users = []
messages = []

with open('./chat.log', 'r') as file_chat:
    # Initialize a counter
    count = 1

    # Iterate over each line in the file
    for line in file_chat:
        # Strip any leading or trailing whitespace
        line = line.strip()

        # Check if the line is empty
        if line:
            # Print the non-empty line
            if "eaquearcana" not in line:
                if "PRIVMSG" in line:
                    blocks = line.split("PRIVMSG")

                    if "—" in blocks[0]:
                        date_time = blocks[0].split(("—"))[0].strip()
                        user_name = blocks[0].split(("—"))[1].strip()
                        user_name = user_name.split("@")[1].split(".")[0].strip()
                    else:
                        date_time = None
                        user_name = blocks[0].split(("—"))[0].strip()
                        user_name = user_name.split("@")[1].split(".")[0].strip()

                    message = blocks[1].split("#casinodaddy :")[1].strip()

                    # Append the extracted information to the lists
                    date_times.append(date_time)
                    users.append(user_name)
                    messages.append(message)

                    print(f"Successfully reached line {count} !")
                    print(line)

                    count += 1

In [None]:

# Create a DataFrame from the lists
df = pd.DataFrame({'Date and Time': date_times, 'User Nickname': users, 'Message': messages})
df['Date and Time'] = df['Date and Time'].fillna(method='ffill')
df['Date and Time'] = df['Date and Time'].fillna(method='bfill')

# Search for specific substrings in the "Date and Time" column and warns the user
search_strings = ["https://", "www.", "bit.ly", "tinyurl."]
df['contains_weblinks'] = np.where(df['Message'].str.contains('|'.join(search_strings)), 'yes', 'no')

# Create a new column "links" containing the complete string containing any of the specified substrings
pattern = r'(?:https?://|www\.|bit\.ly|tinyurl\.)[a-zA-Z0-9-]+\.[a-zA-Z]+(?:\.[a-zA-Z]+)?(?:/[^\s]*)?'


# if we expect two or more URL per chat line
# df['weblinks'] = df['Message'].apply(lambda x: [s for s in re.findall(pattern, x) if any(substring in s for substring in search_strings)])

# if we expect one url only :
df['weblinks'] = df['Message'].apply(lambda x: next((s for s in re.findall(pattern, x) if any(substring in s for substring in search_strings)), None))

