# Data Science - Poster Workshop
# How can we analyze data in order to discover new insights that are eventually in?

We are using the Yeast dataset from:     

    Kenta Nakai
    Institue of Molecular and Cellular Biology
	Osaka, University
	1-3 Yamada-oka, Suita 565 Japan
    nakai@imcb.osaka-u.ac.jp
    http://www.imcb.osaka-u.ac.jp/nakai/psort.html
    Donor: Paul Horton (paulh@cs.berkeley.edu)
    Date:  September, 1996
    See also: ecoli database

This dataset is available here in the [UCI Archive](https://archive.ics.uci.edu/dataset/110/yeast)

For this data science poster workshop we will be seeing different ways of classifying the data

### Here are the poster workshop specifications

- The pw is organised at the end of the semester. The pw is based on the presentation of project.
- A student can participate individually or as a member of a group of up to 3 students and thus take part in the project.
- A project is defined as a performance in the sense of data science on a data set that is provided. Possibly, 2 or 3 data sets will be presented and each group can choose 1 of them.
- The project work consists of demonstrating an aspect of the course using the selected dataset. The terminology must be used.
- The PW is divided into two parts: 

    - Part 1 is an "appetiser" presentation in which the most important aspects of the project are presented (5 minutes). The aim is to promote own work and convince examiners to come and get more information
    - In Part 2, each student/group hangs a poster on the wall and explains the work to the examiners.  

In [72]:
import polars as pl
import seaborn as sns

In [71]:
# Read the dataset
df = pl.read_csv("yeast/yeast.data", separator=' ', truncate_ragged_lines=True, has_header=False)

# Drop the columns with the odd index
df = df.drop(df.columns[1::2])

# Rename the columns
df.columns = ["Sequence_Name", "mcg", "gvh", "alm", "mit", "erl", "pox", "vac", "nuc", "localization_site"]

print(df)

shape: (1_484, 10)
┌───────────────┬──────┬──────┬──────┬───┬─────┬──────┬──────┬───────────────────┐
│ Sequence_Name ┆ mcg  ┆ gvh  ┆ alm  ┆ … ┆ pox ┆ vac  ┆ nuc  ┆ localization_site │
│ ---           ┆ ---  ┆ ---  ┆ ---  ┆   ┆ --- ┆ ---  ┆ ---  ┆ ---               │
│ str           ┆ f64  ┆ f64  ┆ f64  ┆   ┆ f64 ┆ f64  ┆ f64  ┆ str               │
╞═══════════════╪══════╪══════╪══════╪═══╪═════╪══════╪══════╪═══════════════════╡
│ ADT1_YEAST    ┆ 0.58 ┆ 0.61 ┆ 0.47 ┆ … ┆ 0.0 ┆ 0.48 ┆ 0.22 ┆ MIT               │
│ ADT2_YEAST    ┆ 0.43 ┆ 0.67 ┆ 0.48 ┆ … ┆ 0.0 ┆ 0.53 ┆ 0.22 ┆ MIT               │
│ ADT3_YEAST    ┆ 0.64 ┆ 0.62 ┆ 0.49 ┆ … ┆ 0.0 ┆ 0.53 ┆ 0.22 ┆ MIT               │
│ AAR2_YEAST    ┆ 0.58 ┆ 0.44 ┆ 0.57 ┆ … ┆ 0.0 ┆ 0.54 ┆ 0.22 ┆ NUC               │
│ AATM_YEAST    ┆ 0.42 ┆ 0.44 ┆ 0.48 ┆ … ┆ 0.0 ┆ 0.48 ┆ 0.22 ┆ MIT               │
│ …             ┆ …    ┆ …    ┆ …    ┆ … ┆ …   ┆ …    ┆ …    ┆ …                 │
│ YUR1_YEAST    ┆ 0.81 ┆ 0.62 ┆ 0.43 ┆ … ┆ 0.0 ┆ 0.53 ┆ 0.22 ┆ ME2  