# Data Science - Poster Workshop
# How can we analyze data in order to discover new insights that are eventually in?

We are using the Yeast dataset from:     

    Kenta Nakai
    Institue of Molecular and Cellular Biology
	Osaka, University
	1-3 Yamada-oka, Suita 565 Japan
    nakai@imcb.osaka-u.ac.jp
    http://www.imcb.osaka-u.ac.jp/nakai/psort.html
    Donor: Paul Horton (paulh@cs.berkeley.edu)
    Date:  September, 1996
    See also: ecoli database

This dataset is available here in the [UCI Archive](https://archive.ics.uci.edu/dataset/110/yeast)

For this data science poster workshop we will be seeing different ways of classifying the data

### Here are the poster workshop specifications

- The pw is organised at the end of the semester. The pw is based on the presentation of project.
- A student can participate individually or as a member of a group of up to 3 students and thus take part in the project.
- A project is defined as a performance in the sense of data science on a data set that is provided. Possibly, 2 or 3 data sets will be presented and each group can choose 1 of them.
- The project work consists of demonstrating an aspect of the course using the selected dataset. The terminology must be used.
- The PW is divided into two parts: 

    - Part 1 is an "appetiser" presentation in which the most important aspects of the project are presented (5 minutes). The aim is to promote own work and convince examiners to come and get more information
    - In Part 2, each student/group hangs a poster on the wall and explains the work to the examiners.  

In [10]:
import pandas as pd
import seaborn as sns

In [11]:
df = pd.read_csv("yeast/yeast.data", header=None, sep='\s+', engine='python')

df.columns = ["Sequence_Name", "mcg", "gvh", "alm", "mit", "erl", "pox", "vac", "nuc", "localization_site"]

df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1484 entries, 0 to 1483
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Sequence_Name      1484 non-null   object 
 1   mcg                1484 non-null   float64
 2   gvh                1484 non-null   float64
 3   alm                1484 non-null   float64
 4   mit                1484 non-null   float64
 5   erl                1484 non-null   float64
 6   pox                1484 non-null   float64
 7   vac                1484 non-null   float64
 8   nuc                1484 non-null   float64
 9   localization_site  1484 non-null   object 
dtypes: float64(8), object(2)
memory usage: 116.1+ KB


The goal of this poster is to predict our target "localization_site" based on the 8 features given.

We will perform classification on using multiple different models and evaluate the performance of each model based on its accuracy and precision.

Finaly we will look at overall insights we can draw from this analysis.

In [12]:
# The number of different occurences for each localization sites
print(df['localization_site'].value_counts())

# Setup localization_site as a category
df["localization_site"] = df["localization_site"].astype('category')

print(df["localization_site"].cat.categories)

localization_site
CYT    463
NUC    429
MIT    244
ME3    163
ME2     51
ME1     44
EXC     35
VAC     30
POX     20
ERL      5
Name: count, dtype: int64
Index(['CYT', 'ERL', 'EXC', 'ME1', 'ME2', 'ME3', 'MIT', 'NUC', 'POX', 'VAC'], dtype='object')
