## **Exploratory Data Analysis**

Before introducing any model, we conducted an exploratory data analysis to clean and better understand our dataset.
 
The dataset used in this project was obtained from Kaggle and is available at the following link:     
[Multilingual Customer Support Tickets](https://www.kaggle.com/datasets/tobiasbueck/multilingual-customer-support-tickets?select=dataset-tickets-multi-lang3-4k.csv)   
 
It contains a collection of real-world customer support tickets written in english or german, along with metadata such as ticket subject, body, language, assigned queue, priority, and various tags.    
 
We chose this dataset because it offers rich information (dataset size: 40k records) that allows us to classify the appropriate queue for each support ticket based on the given details. By analyzing the info given, we aim to predict which team or department (queue) should handle the ticket.     
 
This classification task can help to automatize the assignment process in customer support systems, ensuring that each ticket is directed to the right team for a timely and effective response.

### Import Libraries

In [2]:
import pandas as pd

### Dataset

In [None]:
# Read the CSV file into a DataFrame
tickets_df = pd.read_csv('../data/dataset-tickets-multi-lang-4-20k.csv')
tickets_df.head()

Unnamed: 0,subject,body,answer,type,queue,priority,language,tag_1,tag_2,tag_3,tag_4,tag_5,tag_6,tag_7,tag_8
0,Unvorhergesehener Absturz der Datenanalyse-Pla...,Die Datenanalyse-Plattform brach unerwartet ab...,Ich werde Ihnen bei der Lösung des Problems he...,Incident,General Inquiry,low,de,Crash,Technical,Bug,Hardware,Resolution,Outage,Documentation,
1,Customer Support Inquiry,Seeking information on digital strategies that...,We offer a variety of digital strategies and s...,Request,Customer Service,medium,en,Feedback,Sales,IT,Tech Support,,,,
2,Data Analytics for Investment,I am contacting you to request information on ...,I am here to assist you with data analytics to...,Request,Customer Service,medium,en,Technical,Product,Guidance,Documentation,Performance,Feature,,
3,Krankenhaus-Dienstleistung-Problem,Ein Medien-Daten-Sperrverhalten trat aufgrund ...,Zurück zur E-Mail-Beschwerde über den Sperrver...,Incident,Customer Service,high,de,Security,Breach,Login,Maintenance,Incident,Resolution,Feedback,
4,Security,"Dear Customer Support, I am reaching out to in...","Dear [name], we take the security of medical d...",Request,Customer Service,medium,en,Security,Customer,Compliance,Breach,Documentation,Guidance,,


We can appreciate that the dataset contains the next variables:
- `subject`: Subject of the customer's email.
- `body`: Body of the customer's email.
- `answer`: The response provided by the helpdesk agent.
- `type`: The type of ticket as picked by the agent (Incident, Request, Problem, Change).
- `queue`: Specifies the department to which the email ticket is routed (General Inquiry, Customer Service, Technical Support, IT Support, Product Support, Billing and Payments, Service Outages and Maintenance, Human Resources, Returns and Exchanges, Sales and Pre-Sales). 
- `priority`: Indicates the urgency and importance of the issue (low, medium, high).
- `language`: Indicates the language in which the email is written (de, en).
- `tag`: Tags/categories assigned to the ticket to further classify and identify common issues or topics, split into ten columns in the dataset (examples: "Product Support," "Technical Support," "Sales Inquiry").

##### Check for duplicates

In [4]:
# checking for duplicates
tickets_df.duplicated().sum() 
# there are no duplicates

np.int64(0)

##### Check datatypes and missing values

In [5]:
tickets_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 15 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   subject   18539 non-null  object
 1   body      19998 non-null  object
 2   answer    19996 non-null  object
 3   type      20000 non-null  object
 4   queue     20000 non-null  object
 5   priority  20000 non-null  object
 6   language  20000 non-null  object
 7   tag_1     20000 non-null  object
 8   tag_2     19954 non-null  object
 9   tag_3     19905 non-null  object
 10  tag_4     18461 non-null  object
 11  tag_5     13091 non-null  object
 12  tag_6     7351 non-null   object
 13  tag_7     3928 non-null   object
 14  tag_8     1907 non-null   object
dtypes: object(15)
memory usage: 2.3+ MB


In [7]:
tickets_df[tickets_df['body'].isnull()]

Unnamed: 0,subject,body,answer,type,queue,priority,language,tag_1,tag_2,tag_3,tag_4,tag_5,tag_6,tag_7,tag_8
17687,Moodle integration broke unexpectedly due to a...,,Could you please offer more specific details?,Problem,Product Support,medium,en,Technical,Bug,Integration,Crash,Documentation,Maintenance,,
17959,Mehrmale Integrationsvorgänge gingen über Nach...,,Ich werde mich gerne um den Fehler kümmern und...,Incident,Product Support,medium,de,Technical,Bug,API,Authentication,Outage,Resolution,Support,Crash


##### Analyze class imbalance