# DSC540 Project 4

DePaul University  
Ilyas Ustun, PhD  
Chicago, IL  

## Rules
- Write your code under the corresponding questions where you see `# Code here`. You are encouraged to use more than one cell.
- Provide explanation in a separate Markdown formatted cell. 
- You can change the cell type by:
    - Clicking on the outer area of the cell type you want to change,
    - Go to the top, and select either Code or Markdown from the dropdown menu.
- Be concise in your explanations and conclusions.
- Write clear code and provide explanation to functions you create by using `#` comment sign.
- For built-in function and methods you use from libraries, provide a very brief explanation of what they do.
- Try to answer the questions by yourself. Use documentation from pandas, sklearn and similar libraries to solve the problem.
- If you are stuck you can use different resources. Do not find an identical project and copy paste the solutions. 
- Write your name before beginning to code.


Important:  
- **Do NOT share the solutions with other people.**
- **Do NOT share the solutions on the internet including but not limited to Github and other platforms.**
- Sign the Honor Pledge below indicating that you have agreed to these rules listed here, and any other ethical and honor rules as required by the university.



- **Deliverables:**
    1. The Python Jupyter notebook file named properly with your name. Example: dsc540_project1_john_doe.ipynb
    2. The HTML output of this code notebook names the same way. Example: dsc540_project1_john_doe.html
        - File -> Download as -> HTML   

Good Luck!

### Nihar Muniraju

**Honor Pledge:**  
I pledge on my honor that I, **Nihar Muniraju**, have followed the rules listed above, that I have not given or received any unauthorized assistance on this assignment. 


## Q1 [10]

A drug company would like to introduce a drug to help patients with Alzheimer's. It is desirable to estimate $θ$, the proportion of the market share that this drug will capture. 
- The company interviews 100 people and 15 of them say that they will buy the drug. (This the observed data: n=100, observed=15) $->$ likelihood
- If in the past new drugs tend to capture a proportion between say .10 and .40 of the market, and if all values in between are assumed equally likely, then $θ ∼ Unif(0.10, 0.40)$ $->$ prior

### What is the traceplot of $\theta$ using Bayesian analysis to estimate the market share for the new drug?
- Sample 10000 draws
- Plot the traceplot

In [2]:
!pip install --upgrade theano numpy




In [3]:
import pymc3 as pm
import os
import shutil
compiledir = theano.config.compiledir
if os.path.exists(compiledir):
    shutil.rmtree(compiledir)
os.makedirs(compiledir)
import arviz as az
import mkl
import theano
theano.config.compiledir = 'tmp'
import warnings
warnings.filterwarnings(action='ignore')


You can find the C code in this temporary file: C:\Users\nihar\AppData\Local\Temp\theano_compilation_error_uhlg_qeh


AttributeError: partially initialized module 'theano' has no attribute 'gof' (most likely due to a circular import)

In [None]:
with pm.Model() as model:
    
    # Priors for unknown model parameters
    theta = pm.Uniform('theta', lower=0.10, upper=0.40)
    
    # Likelihood
    x = pm.Binomial('x', n=100, p=theta, observed=15)

    # Posterior
    # draw 10000 posterior samples
    trace = pm.sample(10000, return_inferencedata=False)
    
    # Plot the trace plot
    az.plot_trace(trace)

### Plot the posterior distribution plot of $ \theta $
- What is the mean posterior value?

In [None]:
# Plot the posterior of trace
az.____(____);


## Bayesian Network Analysis

In this part we will be analyzing medical diagnosis using Bayes Nets. The structure and the Conditional Probabability Distribution tables are shown in the figure below.

![MedicalDiagnosis](Med-diag-bnet.jpg)

- In the first few questions you will build the Bayes Net, set up the Conditional Probability Distribution tables, and associate the CPDs to the network.   
- These steps are extremely important and crucial. Make sure you do the set up correctly, as everything else depends on that.   

## Import Libraries

**Import the usual libraries for pandas and plotting, and sklearn.**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import metrics

In [None]:
# import pgmpy
from pgmpy.models import BayesianModel
from pgmpy.factors.discrete import TabularCPD

In [None]:
import sklearn
sklearn.__version__

### These are the packages I used

In [None]:
# Your package imports here

## Q2 [10]

**Define the model structure.** 

You need to define the network by passing a list of edges. 

In [None]:
model = BayesianModel([('Smokes', 'LungDisease'), ('LungDisease', 'ShortnessBreath'), ('LungDisease', 'ChestPain'), ('LungDisease', 'Cough'), ('Cold', 'Cough'), ('Cold', 'Fever')])

In [None]:
# Your code
# model = BayesianModel([('Smokes', 'LungDisease'),.........])

## Q3 [15] 

**Define individual CPDs**
- Define the CPDs using the state names of the variables. 

In [None]:
cpd_Smokes = TabularCPD(variable='Smokes', variable_card=2, values=[[0.2], [0.8]], state_names={'Smokes' : ['T', 'F']})
print(cpd_Smokes)


cpd_LungDisease = TabularCPD(variable='LungDisease', 
                             variable_card=2, 
                             values=[[0.1009, 0.001],
                                     [0.8991, 0.999]],
                             evidence=['Smokes'],
                             evidence_card=[2],
                            state_names={'Smokes' : ['T', 'F'], 'LungDisease' : ['T', 'F']})
print(cpd_LungDisease)



In [None]:
# Your code

## Q4 [5] 
- Check that each of the CPDs are correct

In [None]:
cpd_smokes

In [None]:
print(cpd_smokes)

In [None]:
# Your code

## Q5 [5]
- Add the defined CPDs to the model (Associate the CPDs with the network)

In [None]:
model.add_cpds(.......)
model.check_model()

## Q6 [5]
- The cardinality of each of the nodes is 2 - True or False?

In [None]:
# Your code

### Q7 [10] 
**Find the probability of each event hapenning using the variable elimination method.**
- $ P(Smokes) $
- $ P(Cold) $
- $ P(LungDisease) $
- $ P(ShortnessBreath) $
- $ P(ChestPain) $
- $ P(Fever) $
- $ P(Cough) $
- $ P(LungDisease|Smokes=True) $
- $ P(LungDisease|Cough=True) $
- $ P(ShortnessBreath|Smokes=True) $
- $ P(ChestPain|Fever=True) $

In [None]:
from pgmpy.inference import VariableElimination
infer = VariableElimination(model)

In [None]:
dist = infer.query(['Smokes'])
print(dist)
# Your code

## Q8 [15] 
**Are the following  true? If not, can you make them independent by including info on the parent?**
1. Cough is independent from Fever. (Having knowledge about Fever does not change the probability of Cough) 
2. Fever is independent from Smokes. 
3. ChestPain is independent from Smokes.
4. ChestPain is independent from Smokes given LungDisease.


##### 1. Cough is independent from Fever

In [None]:
dist = infer.query(['Cough'])
print(dist)

dist = infer.query(['Cough'], evidence={'Fever':'T'})
print(dist)

dist = infer.query(['Cough'], evidence={'Fever':'F'})
print(dist)

- Fever does affect Cough -> Not independent
- However, given the parent of both Fever and Cough (Cold), they should be independent.

In [None]:
dist = infer.query(['Cough'], evidence={'Cold':'F', 'Fever':'T'})
print(dist)

dist = infer.query(['Cough'], evidence={'Cold':'F', 'Fever':'F'})
print(dist)

dist = infer.query(['Cough'], evidence={'Cold':'T', 'Fever':'T'})
print(dist)

dist = infer.query(['Cough'], evidence={'Cold':'T', 'Fever':'F'})
print(dist)

##### 2. Fever is independent from Smokes. 

In [None]:
# Your code

##### 3. ChestPain is independent from Smokes.

In [None]:
# Your code

## Q9 [20]
1. Does having the knowledge that the person is coughing increase the probability of lung disease?
2. Does having the knowledge that the person is coughing increase the probability that the person has Cold?
3. Does having the knowledge that the person is coughing increase the probability that there is Fever?
4. Does having the knowledge that the person is coughing increase the probability that there is Fever, given the person has Cold?
5. Does having the knowledge that the person has Fever increase the probability that there is Lung Disease?


##### 1. Does having the knowledge that the person is coughing increase the probability of lung disease? [5]

In [None]:
dist = infer.query(['LungDisease'])
print(dist)
dist = infer.query(['LungDisease'], evidence={'Cough': 'T'})
print(dist)

Having the knowledge that the person is coughing increases the probability of lung disease.

##### 2. Does having the knowledge that the person is coughing increase the probability that the person has Cold? [5]

In [None]:
# Your code

##### 3. Does having the knowledge that the person is coughing increase the probability that there is Fever? [5]

In [None]:
# Your code

##### 4. Does having the knowledge that the person is coughing increase the probability that there is Fever, given the person has Cold?

In [None]:
# Your code

##### 5. Does having the knowledge that the person has Fever increase the probability that there is Lung Disease?

In [None]:
# Your code

## Q10 [5]
1. What's the most probable state of Cough? 
2. What's the most probable state of Cough given Cold is True? 
3. What's the most probable state of Cough given Cold is True and Lung Disease is True?

##### 1. What's the most probable state of Cough? 

In [None]:
infer.map_query(['Cough'])

The most probable state of Cough is False.

##### 2. What's the most probable state of Cough given Cold is True? 

In [None]:
# Your code

##### 3. What's the most probable state of Cough given Cold is True and Lung Disease is True?

In [None]:
# Your code

# Well done!