# Descripción de la base de datos

Tomado de [aquí](http://lisp.vse.cz/pkdd99/Challenge/tsumoto.htm).

## Domain

The database was collected at Chiba University hospital. Each patient came to the outpatient clinic of the hospital on collagen diseases, as recommended by a home doctor or a general physician in the local hospital.

Collagen diseases are auto-immune diseases. Patients generate antibodies attacking their own bodies. For example, if a patient generates antibodies in lungs, he/she will chronically lose the respiratory function and finally lose life. The disease mechanisms are only partially known and their classification is still fuzzy. Some patients may generate many kinds of antibodies and their manifestations may include all the characteristics of collagen diseases.

In collagen diseases, thrombosis is one of the most important and severe complications, one of the major causes of death. Thrombosis is an increased coagulation of blood, that cloggs blood vessels. Usually it will last several hours and can repeat over time. Thrombosis can arise from different collagen diseases. It has been found that this complication is closely related to anti-cardiolipin antibodies. This was discovered by physicians, one of whom donated the datasets for discovery challenge.

Thrombosis must be treated as an emergency. It is important to detect and predict the possibilities of its occurence. However, such database analysis has not been made by any experts on immunology. Domain experts are very much interested in discovering regularities behind patients' observations.

 ## Goals

1. Search for patterns which detect and predict thrombosis.
2. Search for temporal patterns specific/sensitive to thrombosis. (Examination date is very close to the date on thrombosis. If we can find specific/sensitive patterns before/after the thrombosis, they are very useful.)
3. Search for features which classifies collagen diseases correctly.
4. Search for temporal patterns specific/sensitive to each collagen disease.


<p>&nbsp;</p>
<H3>TSUM_A.CSV</H3>
<p>Basic information about patients (input by doctors).
This dataset includes all patients (about 1000 records).</p>

<TABLE CELLSPACING=0 BORDER CELLPADDING=4 WIDTH=85%>
<TR><TH>item</TH><TH>meaning</TH><TH>remark</TH></TR>
<TR><TD>ID</TD><TD>identification of the patient</TD><TD></TD></TR>
<TR><TD>Sex</TD><TD></TD><TD></TD></TR>
<TR><TD>Birthday</TD><TD></TD><TD>YYYY/M/D</TD></TR>
<TR><TD>Description date</TD><TD>the first date when a patient data was
recorded</TD><TD>YY.MM.DD</TD></TR>
<TR><TD>First date</TD><TD>the date when a patient came to the hospital</TD>
<TD>YY.MM.DD</TD></TR>
<TR><TD>Admission</TD><TD>patient was admitted to the hospital (+) or followed
at the outpatient clinic (-)</TD><TD></TD></TR>
<TR><TD>Diagnosis</TD><TD>
disease names</TD><TD>multivalued attribute</TD></TR>
</TABLE>

<p>&nbsp;</p>
<H3>TSUM_B.CSV</H3>
<p>Special laboratory examinations (input by doctors)
   (measured by the Laboratory on Collagen Diseases). This 
dataset does not include all the patients,
 but includes the patients with these special tests.</P>

<TABLE CELLSPACING=0 BORDER CELLPADDING=4 WIDTH=85%>
<TR><TH>item</TH><TH>meaning</TH><TH>remark</TH></TR>
<TR><TD>ID</TD><TD>identification of the patient</TD><TD></TD></TR>
<TR><TD>Examination Date</TD><TD>date of the test</TD><TD>YYYY/MM/DD</TD></TR>
<TR><TD>aCL IgG</TD><TD>anti-Cardiolipin antibody (IgG) concentration<TD></TD></TR>
<TR><TD>aCL IgM</TD><TD>anti-Cardiolipin antibody (IgM) concentration<TD></TD></TR>
<TR><TD>ANA</TD><TD>anti-nucleus antibody concentration</TD><TD></TD></TR>
<TR><TD>ANA Pattern</TD><TD> pattern observed in the sheet of ANA
examination</TD><TD></TD></TR>
<TR><TD>aCL IgA</TD><TD>anti-Cardiolipin antibody (IgA) concentration</TD><TD></TD></TR>
<TR><TD>Diagnosis</TD><TD>disease names</TD><TD>multivalued attribute</TD></TR>
<TR><TD>KCT</TD><TD>meassure of degree of coagulation</TD><TD></TD></TR>
<TR><TD>RVVT</TD><TD>meassure of degree of coagulation</TD><TD></TD></TR> 
<TR><TD>LAC</TD><TD>meassure of degree of coagulation</TD><TD></TD></TR>
<TR><TD>Symptoms</TD><TD>other symptoms observed 
</TD><TD>multivalued attribute</TD></TR>
<TR><TD>Thrombosis</TD><TD>degree of thrombosis</TD>
<TD>0: negative (no thrombosis) <br> 1: positive (the most severe one)<br> 2: positive
(severe) <br>
3: positive (mild)</TD></TR>
</TABLE>
<br>
Examination date is very close to the date on thrombosis. In negative examples, these tests are 
examined when thrombosis is suspected.

<p>&nbsp;</p>
<H3>TSUM_C.CSV</H3>
<p> Laboratory Examinations stored in Hospital Information Systems
(Stored from 1980 to March 1999) All the data include ordinary laboratory examinations
and have temporal stamps. The tests are not necessarily connected to
thrombosis.</p>


<TABLE CELLSPACING=0 BORDER CELLPADDING=4 WIDTH=85%>
<TR><TH>item</TH><TH>meaning</TH><TH>normal range</TH></TR>
<TR><TD>ID</TD><TD>identification of the patient</TD><TD></TD></TR>
<TR><TD>Date</TD><TD>Date of the laboratory tests (YYMMDD)</TD><TD></TD></TR>
<TR><TD>GOT</TD><TD>AST glutamic oxaloacetic transaminase</TD><TD>N < 60</TD></TR>
<TR><TD>GPT</TD><TD>ALT glutamic pylvic transaminase</TD><TD>N < 60</TD></TR>
<TR><TD>LDH</TD><TD>lactate dehydrogenase</TD><TD>N < 500</TD></TR>
<TR><TD>ALP</TD><TD>alkaliphophatase</TD><TD>N < 300</TD></TR>
<TR><TD>TP</TD><TD>total protein</TD><TD>6.0 < N < 8.5</TD></TR>
<TR><TD>ALB</TD><TD>albumin</TD><TD>3.5 < N < 5.5</TD></TR>
<TR><TD>UA</TD><TD>uric acid</TD><TD>N > 8.0 (Male) <br> N > 6.5 (Female)</TD></TR>
<TR><TD>UN</TD><TD>urea nitrogen</TD><TD>N < 30</TD></TR>
<TR><TD>CRE</TD><TD>creatinine</TD><TD>N < 1.5</TD></TR>
<TR><TD>T-BIL</TD><TD>total bilirubin</TD><TD>N < 2.0</TD></TR>
<TR><TD>T-CHO</TD><TD>total cholesterol</TD><TD>N < 250</TD></TR>
<TR><TD>TG</TD><TD>triglyceride</TD><TD>N < 200</TD></TR>
<TR><TD>CPK</TD><TD>creatinine phosphokinase</TD><TD>N < 250</TD></TR>
<TR><TD>GLU</TD><TD>blood glucose</TD><TD>N < 180</TD></TR>
<TR><TD>WBC</TD><TD>White blood cell</TD><TD>3.5 < N < 9.0</TD></TR>
<TR><TD>RBC</TD><TD>Red blood cell</TD><TD>3.5 < N < 6.0</TD></TR>
<TR><TD>HGB</TD><TD>Hemoglobin</TD><TD>10 < N < 17</TD></TR>
<TR><TD>HCT</TD><TD>Hematoclit</TD><TD>29 < N < 52</TD></TR>
<TR><TD>PLT</TD><TD>platelet</TD><TD>100 < N < 400</TD></TR>
<TR><TD>PT</TD><TD>prothrombin time</TD><TD>N < 14</TD></TR>
<TR><TD>Note</TD><TD>comment for the test PT</TD><TD></TD></TR>
<TR><TD>APTT</TD><TD>activated partial prothrombin time</TD><TD>N < 45</TD></TR>
<TR><TD>FG</TD><TD>fibrinogen</TD><TD>150 < N < 450</TD></TR>
<TR><TD>AT3</TD><TD>marker of DIC, one of the most important
complications
of collagen diseases</TD><TD>70 < N < 130</TD</TR>
<TR><TD>A2PI</TD><TD>marker of DIC</TD><TD>70 < N < 130</TD></TR>
<TR><TD>U-PRO</TD><TD>proteinuria</TD><TD>0 < N < 30</TD></TR>
<TR><TD>IGG</TD><TD>Ig G</TD><TD>900 < N < 2000</TD></TR>
<TR><TD>IGA</TD><TD>Ig A</TD><TD>80 < N < 500</TD></TR>
<TR><TD>IGM</TD><TD>Ig M</TD><TD>40 < N < 400</TD></TR>
<TR><TD>CRP</TD><TD>C-reactive protein</TD><TD>N= -, +-,   or  N < 1.0</TD></TR>
<TR><TD>RA</TD><TD>Rhuematoid Factor</TD><TD>N= -, +-</TD></TR>
<TR><TD>RF</TD><TD>RAHA</TD><TD>N < 20</TD></TR>
<TR><TD>C3</TD><TD>complement 3</TD><TD>N > 35</TD></TR>
<TR><TD>C4</TD><TD>complement 4</TD><TD>N > 10</TD></TR>
<TR><TD>RNP</TD><TD>anti-ribonuclear protein</TD><TD>N= -, +-</TD></TR>
<TR><TD>SM</TD><TD>anti-SM</TD><TD>N= -, +-</TD></TR>
<TR><TD>SCl70</TD><TD>anti-scl70</TD><TD>N= -, +-</TD></TR>
<TR><TD>SSA</TD><TD>anti-SSA</TD><TD>N= -, +-</TD></TR>
<TR><TD>SSB</TD><TD>anti-SSB</TD><TD>N= -, +-</TD></TR>
<TR><TD>CENTROMEA</TD><TD>anti-centromere</TD><TD>N= -, +-</TD></TR>
<TR><TD>DNA</TD><TD>anti-DNA</TD><TD>N < 8</TD></TR>
<TR><TD>DNA-II</TD><TD>anti-DNA</TD><TD>N < 8</TD></TR>
</TABLE>


# Ejercicio

Con lo aprendido en la introducción, realiza lo siguiente

1. Lee los archivos y colocalos en diferentes `RDD`
2. Limpialos los _datasets_ (crea nuevas variables, recodifica, etc)
2. Revisa las tablas de arriba ¿Puedes imaginar tener una sola fuente de datos?
3. Coloca cada `RDD` en una tabla temporal
4. Realiza `queries` para ver que todo esté bien cargado.
5. Une las tablas en una nueva usando `SQL` ¿Se te ocurre otra manera de hacerlo?
6. Lluvia de ideas para realizar análisis
7. Lluvia de ideas para visualización