# 3. Working with unstructured text data

It is relatively easy to work with text data that is organized in a structured format (i.e. something like a CSV). However, in real life texts often occue in other unstructured formats like .txt, .docx, .pdf. These texts need to be treated differently - you need to create the structure yourself.

### 3.1. Word documents

In [142]:
# load interview transcripts for Austrian conservative members
import textract
import glob
cons_interviews_files = glob.glob(r'C:\Users\pauld\OneDrive\PhD\Teaching\2023WiSe-Qualitative methods_MA\Cons_interview_german*.docx')
lab_interviews_files = glob.glob(r'C:\Users\pauld\OneDrive\PhD\Teaching\2023WiSe-Qualitative methods_MA\Lab_interview_german*.docx')


In [143]:
cons_interviews_files

['C:\\Users\\pauld\\OneDrive\\PhD\\Teaching\\2023WiSe-Qualitative methods_MA\\Cons_interview_german_1.docx',
 'C:\\Users\\pauld\\OneDrive\\PhD\\Teaching\\2023WiSe-Qualitative methods_MA\\Cons_interview_german_2.docx',
 'C:\\Users\\pauld\\OneDrive\\PhD\\Teaching\\2023WiSe-Qualitative methods_MA\\Cons_interview_german_3.docx']

In [80]:
import docx2txt

cons_interviews = []

for filename in cons_interviews_files:
    text = docx2txt.process(filename)
    cons_interviews.append(text)

In [81]:
cons_interviews[0]

'Interviewer: Guten Tag, meine Damen und Herren. Heute haben wir die Ehre, einen prominenten Gast zu begrüßen, einen konservativen Abgeordneten des österreichischen Parlaments. Herzlich willkommen, Herr [Name]. Wir freuen uns, mit Ihnen über aktuelle politische Themen zu sprechen.\n\n\n\nAbgeordneter: Guten Tag, es ist mir eine Freude, hier zu sein.\n\n\n\nInterviewer: Lassen Sie uns direkt einsteigen. Eines der beherrschenden Themen in der politischen Landschaft ist derzeit die Migrationspolitik. Wie stehen Sie zu diesem Thema, insbesondere im Kontext von Österreich?\n\n\n\nAbgeordneter: Die Migrationspolitik ist zweifellos von großer Bedeutung. Österreich hat eine lange Tradition der Offenheit, aber es ist wichtig sicherzustellen, dass wir eine kontrollierte und nachhaltige Einwanderungspolitik haben. Wir müssen die Bedürfnisse unserer Gesellschaft und Wirtschaft berücksichtigen und sicherstellen, dass unsere Grenzen geschützt sind. Dies bedeutet jedoch nicht, dass wir den humanitäre

In [None]:
# what's wrong here, if we want to analyze the MPs language? 

In [110]:
# Define a regular expression pattern to capture only Abgeordneter or MP  statements --> regular expressions are like a language to identify patterns in text, quite complicated to understand
import re
pattern1 = r'Abgeordneter:.*' # it basically takes all the text that follows instances where an Abgeordneter speaks
pattern2 = r'MP:.*'

statement_list_con = []
MP_list_con = []

nr = 1

# for loop to iterate through the list of conserative interview transcripts
for transcript in cons_interviews:
    
    #create an ID for the MP: 
    MP = 'con_'+str(nr) # here we create an ID which looks like con_1, con_2, etc. that we can use to attach all text from the first transcript 
    
    # Use re.findall to extract Abgeordneter statements 
    abgeordneter_statements = re.findall(pattern1, transcript)
    
    
    # in case it didn't find anything for Abgeordneter, try it for 'MP'
    if abgeordneter_statements == []:
        abgeordneter_statements = re.findall(pattern2, transcript)
    
    #now we append the statement_list with all found statements and MP IDs
    for statement in abgeordneter_statements:
        statement_list_con.append(statement)
        MP_list_con.append(MP)
    
    # this is necessary to create the MP identifier 
    nr = nr+1


In [106]:
# now same for labour
lab_interviews = []

for filename in lab_interviews_files:
    text = docx2txt.process(filename)
    lab_interviews.append(text)

    
# Define a regular expression pattern to capture only Abgeordneter or MP  statements --> regular expressions are like a language to identify patterns in text, quite complicated to understand
import re
pattern1 = r'Abgeordneter:.*' # it basically takes all the text that follows instances where an MP speaks
pattern2 = r'MP:.*'

statement_list_lab = []
MP_list_lab = []
nr = 1
for transcript in lab_interviews:
    
    #create an ID for the MP: 
    MP = 'lab_'+str(nr)
    
    # Use re.findall to extract Abgeordneter statements
    abgeordneter_statements = re.findall(pattern1, transcript)
    
    
    # in case it didn't find anything for Abgeordneter, return it for 'MP'
    if abgeordneter_statements == []:
        abgeordneter_statements = re.findall(pattern2, transcript)
    
    for statement in abgeordneter_statements:
        statement_list_lab.append(statement)
        MP_list_lab.append(MP)
    
    nr = nr+1

In [115]:
import pandas as pd
# now create a dataframe that is similar to what we used in Notebook 2
df_con = pd.DataFrame(list(zip(MP_list_con,statement_list_con)), columns = ['MP', 'Text'])
df_lab = pd.DataFrame(list(zip(MP_list_lab,statement_list_lab)), columns = ['MP', 'Text'])


df = pd.merge(df_con, df_lab, how = 'outer') #'outer means that it basically doesn't merge but just append the two dataframes to each other

# add a party variable 
party_list = []
for MP in df['MP']: 
    if 'con' in MP: 
        party_list.append('Con')
    else: #whatch out, this takes everything else (only works if you have not more than two parties) 
        party_list.append('Lab')
df['party'] = party_list


#add an empty column for coding
df['your_code'] = ''

In [116]:
df

Unnamed: 0,MP,Text,party,your_code
0,con_1,"Abgeordneter: Guten Tag, es ist mir eine Freud...",Con,
1,con_1,Abgeordneter: Die Migrationspolitik ist zweife...,Con,
2,con_1,Abgeordneter: Die Wirtschaft ist das Rückgrat ...,Con,
3,con_1,Abgeordneter: Österreich ist ein stolzes Mitgl...,Con,
4,con_1,Abgeordneter: Der Klimawandel ist zweifellos e...,Con,
5,con_1,Abgeordneter: Ich möchte die Bürgerinnen und B...,Con,
6,con_1,Abgeordneter: Vielen Dank. Es war mir eine Fre...,Con,
7,con_2,Abgeordneter: Vielen Dank für die Einladung. D...,Con,
8,con_2,Abgeordneter: Einwanderung ist eine komplexe A...,Con,
9,con_2,Abgeordneter: Der Platz Österreichs in der Eur...,Con,


In [None]:
## ------- insert code from Notebook 2 -----------


## run through the same steps 

### 3.2. Opening PDFs

In [122]:
from tika import parser # pip install tika

raw = parser.from_file(r'C:\Users\pauld\OneDrive\PhD\Teaching\2023WiSe-Qualitative methods_MA\Using Python to assist in qualitative data analysis.pdf')
raw['content']

'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWorkshop preparation: Using automated methods to assist in \n\nqualitative data analysis \n\nComputational methods can be a powerful aid to qualitative data analysis. In this hands-on workshop, \n\nwe will try analyzing text using Python, one of the simplest and most commonly used programming \n\nlanguages. Please follow the instructions below to make this a successful session.  \n\n1. To bring to class \n\n- Please bring your own text data to analyze in class. The data should either be saved as .doc, .pdf, .txt, \n\nor in a .csv/.xlsx file. You can bring multiple separate files or one large text file. Of course, this could \n\nbe interview transcripts or observations that you collected for this class. In case you cannot bring any \n\ndata yourself, I will provide you with an alternative text to work with.  \n\n- You will need a charged(!) laptop in class – also bring a charger, working with data can be ba

In [124]:
text = raw['content'].replace('\n', '')
text

'Workshop preparation: Using automated methods to assist in qualitative data analysis Computational methods can be a powerful aid to qualitative data analysis. In this hands-on workshop, we will try analyzing text using Python, one of the simplest and most commonly used programming languages. Please follow the instructions below to make this a successful session.  1. To bring to class - Please bring your own text data to analyze in class. The data should either be saved as .doc, .pdf, .txt, or in a .csv/.xlsx file. You can bring multiple separate files or one large text file. Of course, this could be interview transcripts or observations that you collected for this class. In case you cannot bring any data yourself, I will provide you with an alternative text to work with.  - You will need a charged(!) laptop in class – also bring a charger, working with data can be battery-intensive. Please install the necessary software ahead of class: We will run Python using Jupyter Notebook, which 

### 3.3. Opening .txt

In [None]:
text = open(r'C:\Users\pauld\OneDrive\PhD\Teaching\2023WiSe-Qualitative methods_MA\Syllabus_M2_updated.txt','r')


In [137]:
text.read()

'\ufeffM2: Qualitative methods seminar\nDepartment of Political Science\nUniversity of Vienna\nWinter Semester 2023/24\n8 ECTS\n\nWednesdays 4.10.2023- 31.1.2024 \n15 pm – 18.15 pm \nLecture hall 1 (H1), NIG 2nd floor\n\nMSc Paul Dunshirn (paul.dunshirn@univie.ac.at)\nProf. Hendrik Wagenaar (hendrik.wagenaar@gmail.com)\n\nOffice hours: upon agreement\n\nSyllabus\nContent: \nThis seminar is an advanced introduction to qualitative research with a topical focus on environmental politics. Students will learn how to study various sites of environmental politics, such as local resource management, science-policy interactions, and civil society engagement using qualitative methods.\n\nBesides learning how to conduct qualitative research via practical exercises, students will learn to reflect on their own positioning at the intersection of politics and research. One session addresses ways of combining qualitative research with quantitative methods (mixed methods). \n\nThis seminar benefits fro